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Preface 


No applied mathematician can be properly trained without some basic un- 
derstanding of numerical methods, i.e., numerical analysis. And no scientist 
and engineer should be using a package program for numerical computa- 
tions without understanding the program’s purpose and its limitations. 
This book is an attempt to provide some of the required knowledge and 
understanding. It is written in a spirit that considers numerical analysis 
not merely as a tool for solving applied problems but also as a challenging 
and rewarding part of mathematics. The main goal is to provide insight 
into numerical analysis rather than merely to provide numerical recipes. 

The book evolved from the courses on numerical analysis I have taught 
since 1971 at the University of Gottingen and may be viewed as a successor 
of an earlier version jointly written with Bruno Brosowski [10] in 1974. It 
aims at presenting the basic ideas of numerical analysis in a style as concise 
as possible. Its volume is scaled to a one-year course, i.e., a two-semester 
course, addressing second-year students at a German university or advanced 
undergraduate or first-year graduate students at an American university. 

In order to make the book accessible not only to mathematicians but 
also to scientists and engineers, I have planned it to be as self-contained as 
possible. As prerequisites it requires only a solid foundation in differential 
and integral calculus and in linear algebra as well as an enthusiasm to see 
these fundamental and powerful tools in action for solving applied prob- 
lems. A short presentation of some basic functional analysis is provided in 
the book to the extent required for a modern presentation of numerical 
analysis and a deeper understanding of the subject. 


vi Preface 


An introductory book of a few hundred pages cannot completely cover 
all classical aspects of numerical analysis and all of the more recent devel- 
opments. I am willing to admit that the choice of some of the topics in the 
present volume is biased by my own preferences and that some important 
subjects are omitted. 

I was taught numerical analysis in the mid sixties by my thesis adviser, 
Professor Erich Martensen, at the Technische Hochschule in Darmstadt. 
Martensen’s perspective on teaching mathematics in general and numeri- 
cal analysis in particular had a great and long-lasting impact on my own 
teaching. Therefore, this book is dedicated to Erich Martensen on the oc- 
casion of his seventieth birthday. 

I would like to thank Thomas Gerlach and Peter Otte for carefully read- 
ing the book, for checking the solutions to the problems, and for a number 
of suggestions for improvements. Special thanks are given to my friend 
David Colton for reading over the book for correct use of the English lan- 
guage. Part of the book was written while I was on sabbatical leave at the 
Department of Mathematical Sciences at the University of Delaware and 
the Department of Mathematics at the University of New South Wales. I 
gratefully acknowledge the hospitality of these institutions. I also am grate- 
ful to Springer-Verlag for being willing to take the economic risk of adding 
yet another volume to the already huge number of existing introductions 
to numerical analysis. 


G6ttingen, September 1997 Rainer Kress 
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Glossary of Symbols 


Sets and Spaces 


IN set of natural numbers 

ZL, set of integers 

R set of real numbers 

C set of complex numbers 

|x| absolute value of a real or complex number x 

(a, b) open interval (a,b) := {tc € R:a<2 <b} 

[a, 5] closed interval [a,b] := {c € R:a<z2 <b} 

£ conjugate of a complex number zx 

IR” n-dimensional real Euclidean space 

¢” n-dimensional complex Euclidean space 

Cla, b] space of real- or complex-valued continuous 
functions on the interval [a, b] 

C™ Ia, 5] space of m-times continuously 
differentiable functions 

L?{a, b] space of real- or complex-valued 
square-integrable functions 

{a1,...,@m} set of m elements @1,...,@m 

Uxv product Ux V := {(z,y):rEU,yEV} 
of two sets U and V 

U\V difference set U\ V:={xeU:2 ¢V} 
for two sets U and V 

U closure of a set U 

F:X ~Y a mapping with domain X and range in Y 


X11 


Glossary of Symbols 


Vectors and Matrices 


L = (£1,-.-.,2n) 
a? = (21,...,2n)? 
r* = (Z1,..-, En)" 
A = (ajx) 

Al 

A* 

At 

Av! 

det A 

cond(A) 

p(A) 

I 


diag(a1,..., An) 


Norms 


Il «| 

I «Ila 
Il - Ile 
(-,+) 


Miscellaneous 


row vector in IR” or C” 

with components 21,...,2n 

the transpose of 2, i.e., a column vector 
the adjoint of x 

m Xn matrix with elements aj, 

the transpose of A 

the adjoint of A 

the pseudo-inverse of A 

the inverse of an n x n matrix A 

the determinant of an n x n matrix A 
the condition number of an n x n matrix A 
the spectral radius of an n x n matrix A 
the n x n identity matrix 

diagonal matrix with 

diagonal elements aj,...,@n 


norm on a linear space 

f, norm of a vector, £; norm of a function 
f norm of a vector, Lz norm of a function 
maximum norm of a vector or a function 
scalar product on a linear space 


element inclusion 

set inclusion 

union and intersection of sets 
empty set 

a quantity of order m 

end of proof 
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Introduction 


Numerical analysis is concerned with the development and investigation of 
constructive methods for the numerical solution of mathematical problems. 
This objective differs from a pure-mathematical approach as illustrated by 
the following three examples. 

By the fundamental theorem of algebra, a polynomial of degree n has 
n complex zeros. The various proofs of this result, in general, are noncon- 
structive and give no procedure for the explicit computation of these zeros. 
Numerical analysis provides constructive methods for the actual computa- 
tion of the zeros of a polynomial. 

The solution of a system of n linear equations for n unknowns can be 
given explicitly by Cramer’s rule. However, Cramer’s rule is only of the- 
oretical importance, since for actual computations it is completely useless 
for linear systems with more than three unknowns. An important task 
in numerical analysis consists in describing and developing more practical 
methods for the solution of systems of linear equations. 

By the Picard—Lindelof theorem, the initial value problem for an ordinary 
differential equation has a unique solution (under appropriate regularity as- 
sumptions). Despite the fact that the existence proof in the Picard—Lindel6f 
theorem actually is constructive through the use of successive iterations, in 
applied mathematics there is need for more effective procedures to numer- 
ically solve the initial value problem. 

In general, we may say that for the basic problems in numerical analysis 
existence and uniqueness of a solution are guaranteed through the results 
of pure mathematics. The main topic of numerical analysis is to provide 
efficient numerical methods for the actual computation of the solution. In 
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some cases these numerical methods are actually based on constructive 
existence proofs. 

By a constructive method we understand a procedure that for any pre- 
scribed accuracy determines an approximate solution by a finite number 
of computational steps. In general, the number of computational steps of 
course will depend on the required accuracy. Only very few methods will 
terminate with the exact solution after finitely many computational steps 
as, for example, Gaussian elimination for solving a system of linear equa- 
tions. In most cases, the numerical methods will only yield approximations 
to the exact solution. As a typical example, the numerical evaluation of 
a definite integral by the trapezoidal rule will, in general, provide only 
an approximate value for the integral. In this context two main questions 
arise, namely the question of estimating the error between the exact and 
the approximate solution and the question of numerical stability. 

A numerical method is useful only if it is possible to decide on the accu- 
racy of the approximate solution, i.e., if reliable estimates on the difference 
between the exact and approximate solution can be given. Therefore, be- 
sides the development and design of numerical schemes, a substantial part 
of numerical analysis is concerned with the investigation and estimation of 
the errors occurring in these schemes. Here one has to discriminate between 
the approximation errors, i.e., the errors that arise through replacing the 
original problem by an approximate problem, and the roundoff errors, i.e., 
the errors that occur through the fact that in the actual computation, in 
general, real numbers are replaced by floating-point decimal numbers with 
a fixed number of digits. 

As far as stability is concerned, one has to distinguish between properly 
and improperly posed problems. A problem is called properly posed or 
well-posed if the solution depends continuously on the data, i.e., if small 
changes in the data cause only small changes in the solution. Otherwise, the 
problem is called improperly posed or ill-posed. Numerical approximations 
never can circumvent the improper posedness of a problem. However, it is 
desirable to control the effects of the ill-posed nature of a problem by an 
adequate choice of the numerical method. On the other hand, for properly 
posed problems efforts have to be made not to destroy the well-posedness 
by a poorly designed numerical approximation. 

To the author’s taste, the topic of stability and properly posedness is 
more challenging from a mathematical perspective than the rather unin- 
spiring topic of roundoff errors. Therefore, in this book emphasis is given 
to ill-posedness and the related issue of ill-conditioning, whereas the dis- 
cussion of roundoff errors is given only cursory attention. 

The basic problems of numerical analysis are as old as mathematics it- 
self, and for a number of problems there exist classical approaches such as 
Newton’s method for the solution of nonlinear equations, Gaussian elimi- 
nation for the solution of systems of linear equations, Gauss—Seidel and 
Jacobi iterations for linear systems, Lagrange interpolation for the ap- 
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proximation of arbitrary functions by polynomials, Simpson’s rule for nu- 
merical integration, and Euler’s method for the solution of initial value 
problems. However, the main breakthrough of numerical methods is con- 
nected with the advances in computer technology made within the last 
four decades. Only the electronic computer allows one to perform exten- 
sive numerical computations without error and within a reasonable amount 
of time. Hence, progress in numerical analysis and computer science have 
always been closely interrelated in recent history. 

This book will introduce the reader to the following branches of numerical 
analysis: 

Solution of systems of linear and nonlinear equations, 

Numerical solution of matrix eigenvalue problems, 

Interpolation and numerical integration, 

Numerical solution of initial and boundary value problems for differ- 

ential equations, 

Numerical solution of integral equations. 
Of course, in an introductory exposition of only about three hundred pages 
it is impossible to cover all of these areas exhaustively. Therefore, the reader 
should not expect a comprehensive treatment of all existing numerical pro- 
cedures. As already pointed out in the preface, our goal will be to guide 
the reader toward the basic ideas and questions in each of the above top- 
ics with an emphasis on the analysis and the understanding of numerical 
methods rather than merely their description. In order to achieve this, 
we will try to illustrate general principles by way of considering the main 
and most important methods, and we will leave aside discussions of more 
elaborate details of advanced methods and the consideration of lengthy 
subtleties for exceptional cases. Given the rapid development of numerical 
methods, a reasonable introduction to numerical analysis has to confine 
itself to presenting a solid foundation by restricting the presentation to the 
basic principles and procedures. 

The book includes a chapter on the necessary basic functional-analytic 
tools for the solid mathematical foundation of numerical analysis. These 
are indispensable for any deeper study and understanding of numerical 
methods, in particular for differential equations and integral equations. 

The limit of space and the taste and restrictions in experience of the 
author have caused the omission of some important topics such as linear 
and nonlinear optimization, approximation theory, and parallel computing, 
among others. On the other hand, with separate chapters on the solution 
of ill-conditioned systems of linear equations and the numerical solution 
of integral equations two topics are included that do not appear in most 
introductions to numerical analysis. They are included because of their im- 
portance and in order to indicate to the reader where the author’s mathe- 
matical research interests lie. 

A study of numerical analysis remains incomplete without the numer- 
ical experience of individually implementing the numerical algorithms. It 
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is very important to build up a familiarity with numerical methods by ac- 
tually seeing the numbers working. For example, one has to complement 
the theoretical understanding of the method of successive approximations 
by the experience of actually running the numerical schemes. After hav- 
ing understood the basic principles of a numerical method, it is important 
to develop the ability to actually implement the method numerically and 
work with it. In this sense the reader is encouraged to test on the computer 
numerically all of the algorithms presented in this book. 

The organization of the book is as follows. The first part of the book, 
Chapters 2 to 7, covers numerical linear algebra and is concerned with 
the solution of systems of linear and nonlinear equations. The necessary 
functional-analytic tools will be presented in Chapter 3. The second part 
of the book, Chapters 8 to 12, covers numerical analysis and is concerned 
with interpolation, numerical integration, and the numerical solution of dif- 
ferential and integral equations. At the reader’s convenience it is possible to 
study most of the second part of the book before reading the first part, with 
the exception of the chapter on functional analysis. Each chapter concludes 
with a set of problems. These are intended as exercises and applications of 
the material given in the chapter. 

The references at the end of the book are intended as a possible guide to 
some of the literature covering the topics of the individual chapters more 
exhaustively. The list of references is not meant as a bibliography on the 
vast number of introductions to numerical analysis competing with this 
book. However, we explicitly encourage the reader to explore the libraries 
and consult some of the other volumes on numerical analysis in order to 
develop a broad perspective. 
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Linear Systems 


The solution of systems of linear equations arises in various parts of mathe- 
matics and is of central importance in numerical analysis. To illustrate the 
significance of linear systems, we will start this chapter by providing some 
examples of their occurrence as part of the numerical solution of differential 
and integral equations. After seeing the examples, we will proceed with the 
solution of systems of linear equations. In principle, we have to distinguish 
between two groups of methods for the solution of linear systems: 

1. Inthe so-called direct methods, or elimination methods, the exact solu- 
tion, in principle, is determined through a finite number of arithmetic 
operations (in real arithmetic leaving aside the influence of roundoff 
errors). 

2. In contrast to this, iterative methods generate a sequence of approx- 
imations to the solution by repeating the application of the same 
computational procedure at each step of the iteration. Usually, they 
are applied for large systems with special structures that ensure con- 
vergence of the successive approximations. 

A key consideration for the selection of a solution method for a linear 
system is its structure. In some problems, the matrix of the linear system 
may be a full matrix, i.e., it has few zero entries. And in other problems, 
the matrix may be very large and sparse, i.e., only a small fraction of the 
entries are different from zero. Roughly speaking, direct methods are best 
for full matrices, whereas iterative methods are best for very large and 
sparse matrices. 
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We will begin our treatment of linear systems by presenting the best- 
known and most widely used direct method, which is attributed to Gauss, 
since it is based on considerations published by Gauss in 1801 in his Dis- 
quisitiones Arithmeticae. The chapter concludes with a brief description of 
elimination by orthonormal decomposition. 

In this book, for an m x n matrix A = (a;,), 7 =1,...,m,k =1,...,n, 
with real or complex coefficients, A’ shall always denote the transposed 
matrix; i.e., A? is the n x m matrix with entries 

Oy; = aj, kK=1,...,.n, j=1,...,m. 
By A* we denote the adjoint of the matrix A; i.e., A* = A’ is the transpose 
of the matrix with complex conjugate entries. In particular, the transpose 
and adjoint of a row vector are column vectors and vice versa. 


2.1 Examples for Systems of Equations 


Example 2.1 We consider the discretization of the boundary value prob- 
lem for the ordinary differential equation 


—u"(x) = f(z,u(z)), «x € [0,1], (2.1) 
with boundary condition 
u(O) = u(1) = 0. (2.2) 


Here, f : [0,1] x IR > R is a given continuous function, and we are looking 
for a twice continuously differentiable solution u : [0,1] + IR. Boundary 
value problems of this type occur, for example, in the mathematical treat- 
ment of vibrations of a string or a rod and in the solution of heat conduction 
problems. They often also arise in the solution of problems like the following 
Example 2.2 after applying separation of variables. The theory of ordinary 
differential equations (see [12]) provides conditions on the right-hand side f 
of (2.1), ensuring existence and uniqueness of a solution u to the boundary 
value problem (2.1)—(2.2) (for the case of linear differential equations see 
also Chapter 11). 

For the approximate solution we choose an equidistant subdivision of the 
interval [0, 1] by setting 


v;=jh, j=0,...,n+#1, 


where the step size is given by h = 1/(n+ 1) with n € IN. At the internal 
grid points z;,j = 1,...,n, we replace the differential quotient in the 
differential equation (2.1) by the difference quotient 


ul (x5) ow 7 [u(t j41) — 2u(z;) + u(x;~-1)| 
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to obtain the system of equations 


1 
~ Fe [uj-1 — 2u; + uj4i] = f(zj,uj), g=l,....n, 


for approximate values u; to the exact solution u(z;). This system has to 
be complemented by the two boundary conditions ug = Un+1 = O. For an 
abbreviated notation we introduce the n x n matrix 


2 -1l 
—] 2 —-1 
1 —] 2 —-1l 
A= 75 BO 
—1 2 -l 
—] 2 


and the vectors U = (uj,...,tn)? and F(U) = (f(a1,u1),..-,f(2n,Un))?- 
Then our system of equations, including the boundary conditions, reads 


AU = F(U). (2.3) 


For obvious reasons, the above matrix A is called a tridtagonal matrix, and 
the vector F' is diagonal; i.e., the jth component of F depends only on 
the 7th component of u. If (2.1) is a linear differential equation, i.e., if f 
depends linearly on the second variable u, then the tridiagonal system of 
equations (2.3) also is linear. 

The following two questions will be addressed later in the book (see 
Chapter 11): 

1. Can we establish existence and uniqueness of a solution to the system 
of equations (2.3) for sufficiently small step size h, provided that the 
boundary value problem (2.1)—(2.2) itself is uniquely solvable? 

2. How large is the error between the approximate solution u; and the 
exact solution u(z;)? Do we have convergence of the approximate 
solution towards the exact solution as h > 0? 

At this point we would like only to point out that the discretization of 
boundary value problems for ordinary differential equations leads to sys- 
tems of equations with a large number of unknowns, since we expect that 
in order to achieve a reasonably accurate approximation we need to choose 
the step size h sufficiently small. O 


Example 2.2 We now consider the discretization of the boundary value 
problem for the elliptic partial differential equation 


—Au(z) = f(z,u(z)), rE D, (2.4) 
with Dirichlet boundary condition 


u(x) =0, x EOD. (2.5) 
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Here, D C IR? is a bounded domain, A denotes the Laplacian 


eu O*u 
Au := =3+ 245 
. Ox? = Ox’ 


f: Dx - R is a given continuous function, and we are looking for 
a solution u: D > R that is continuous in D and twice continuously 
differentiable in D. Boundary value problems of this type arise, for example, 
in potential theory and in heat conduction problems. The theory of elliptic 
partial differential equations (see [24]) provides conditions on the given 
function f that ensure existence and uniqueness of a solution wu. 

For describing a numerical approximation method we restrict ourselves 
to the case of the square D = (0,1) x (0,1). We choose an equidistant 
quadratic grid with grid points 


riz = (th, jh), 1,9 =0,...,n +1, 


where the step size again is given by h = 1/(n+1) with n € IN. Analogously 
to the previous example, at the internal grid points 2;;, 7,7 = 1,...,n, we 
replace the Laplacian by the Laplace difference operator 


1 
Au(xi,j) © }2 [u(eiza,j) + u(as—1,j) + ulaij+i) + u(ei,z-1) — 4u(xi;)]. 


Obviously, for each point 2x;;, this difference operator has nonvanishing 
weights only at the four neighboring points on the vertical and horizontal 
line through 2;;. This observation also illustrates why the set of grid points 
with nonvanishing weights is called the star associated with the Laplace 
difference operator. Using this difference approximation leads to the system 
of equations 


1 . 
rE) [4ui; — Ui4+1,j 7 Ui-1,j ~ Ui,j+1 — Ui,j—-1| = f (taj, Viz), ij=Hl,...,n, 


h 


for approximate values u;; to the exact solution u(z;;). This system has to 
be complemented by the boundary conditions 


Uo,j = Unti,j = 9, 73 =0,...,n+1, 
at the grid points on the vertical parts and 
Uio = Vint = J, 1=1,...,n, 


at the grid points on the horizontal parts of the boundary OD. In order to 
write this system in matrix form we rearrange the unknowns by ordering 
them row by row and setting 


Uy = Ui1, U2 = U21,---,Un = Uni, Unt = U12,---;Um = Unn 
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where m = n’. Furthermore, we introduce an m x m matrix A in the form 
of an n x n block tridiagonal matrix 


B -!I 
—[ B -ITI 
1 —~[ B -I] 
A= en 
—[ B -!I 
—[ B 


where J denotes the n x n identity matrix and B is the n x n tridiagonal 
matrix 


4 -l 
—l 4 -l 
Bez —] 4 -1 
—1 4 -l 
—]1 4 


After introducing the vectors U and F(U) analogously to Example 2.1, we 
can rewrite the system of equations in the short form 


AU = F(U), (2.6) 


which also includes the boundary conditions. 

Again we postpone the questions of unique solvability of the system (2.6) 
and the problem of convergence and error estimates for later parts of the 
book (see Chapter 11). Here, we conclude the example with the observation 
that the system has n* unknowns, where n will be fairly large if the step 
size h is sufficiently small in order to achieve a reasonably accurate approxi- 
mation to the solution of the boundary value problem. These large systems 
of equations arising in the discretization of partial differential equations 
call for efficient solution methods. oO 


Example 2.3 Consider the linear integral equation 


y(e) - / K(e,y)o(y) dy = f(a), 2 € [0,1], 


where K : [0,1] x [0,1] > R and f : [0,1] — R are given continuous func- 
tions and where we seek a continuous solution y : [0, 1] — IR. Such integral 
equations either arise directly in the solution of applied problems, or more 
often they occur indirectly in the solution of boundary value problems for 
differential equations. If the homogeneous form of this equation, i.e., the 
integral equation with the right-hand side f = 0, admits only the trivial 
solution y = 0, then for each f the inhomogeneous integral equation has a 
unique solution y (see Chapter 12). 
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For the numerical approximation we replace the integral by the rectan- 
gular sum 


/ K(z,y)p(y) dy : S_ K(x, 2k)(xr) 
0 k=1 


with equidistant grid points 7, = k/n,k = 1,...,n. If we require the 
approximated equation to be satisfied only at the grid points, we arrive at 
the system of linear equations 


— | 
Pi LB (ai te)ee = fla) j=il,...,n, 


for approximate values y; to the exact solution y(z;). As in the preced- 
ing examples, we postpone the question of unique solvability of the linear 
system and the convergence and error analysis (see Chapter 12). O 


Example 2.4 In this last example we will briefly touch on the method of 
least squares. Consider some (physical) quantity u depending on time ¢ and 
a parameter vector a = (@),... ,a,)? € IR” in terms of a known function 


u(t) = f(t;a). 


In order to determine the values of the parameter a (representing some 
physical constants), one can take m measurements of u at different times 
t1,...,tm and then try to find a by solving the system of equations 


u(t; ) = f (tj; a), 7 = 1,. 257. 


If m = n, this system consists of n equations for the n unknowns q@q,..., Gn. 
However, in general, the measurements will be contaminated by errors. 
Therefore, usually one will take m > n measurements and then will try to 
determine a by requiring the deviations 


u(t;) — f(t;; a), jguil,...,m, 
to be as small as possible. Usually the latter requirement is posed in the 
least squares sense, i.e., the parameter a is chosen such that 


m 


g(a) = > [u(te) — f (tes @))? 
k=1 
attains a minimal value. The necessary conditions for a minimum, 
Og 
—=0, j= 1.,...,n, 
Oa; j 


lead to the normal equations 


2 [ult — f (th; a)] Tee =0, j=l,...,n, 
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for the method of least squares. These constitute a system of n, in general, 
nonlinear equations for the n unknowns @),..., Qn. O 


At this point, the reader should be convinced of the need for effective 
methods for solving large systems of linear and nonlinear equations and be 
willing to be introduced to such methods in the subsequent chapters. We 
also wish to note that the discretization of differential equations leads to 
sparse matrices, whereas for the least squares problem and the discretiza- 
tion of integral equations one is faced with full matrices. 


2.2 Gaussian Elimination 


We proceed with describing the Gaussian elimination method for a system 
of linear equations 
Ag = y. 


Here A is a given n x n matrix A = (a;,) with real (or complex) entries, y 
a given right-hand side y = (y1,-..,Yn)’ € IR” (or ©”), and we are looking 
for a solution vector x = (21,...,2n)? € IR” (or €”). More explicitly, our 
system of equations can be written in the form 


n 
S| ajern = Ys; g=l,...,n; 


k=1 
that is, 
Q11%1 + Qj2%2 + +++ + AinFn = YI 
G212%1 + Qo2%Q2 + +++) + Gann = Yo 
Qn1iZ1 + Qn2%q + +> + 4AnnIn = Yn- 


Assuming that the reader is familiar with basic linear algebra, we recall the 
following various ways of saying that the matrix A is nonsingular: 
1. The inverse matrix A~! exists. 
. For each y the linear system Ax = y has a unique solution. 
. The homogeneous system Az = 0 has only the trivial solution. 
. The determinant of A satisfies det A # 0. 
. The rows (columns) of A are linearly independent. 


om GW bo 


The very basic idea of the Gaussian elimination method is to use the first 
equation to eliminate the first unknown from the last n — 1 equations, then 
use the new second equation to eliminate the second unknown from the last 
n — 2 equations, etc. This way, by n — 1 such eliminations the given linear 
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system is transformed into an equivalent linear system that is of triangular 


form 
b1121 + bj2X2 + vce + bintn = 2] 


bo2X2 + ce + bon tn = 292 


bn—-1,n-12n-1 + On—1.nEn 2n—1 


OnnZ&n = £n 


Recall that two linear systems are called equivalent if every solution of one 
is a solution of the other. The triangular system can be solved recursively 
by first obtaining z, from the last equation, then obtaining x,_; from the 
second to last equation, etc. This procedure is known as backward substi- 
tution. Explicitly, it is described by ty = Zn /bnn and 


1 rT 
Lm = —— (en S- hme m=n-—1,n—-2,...,1. 


b 
mm k=m+1 


We begin by considering a nonsingular matrix A. To eliminate the un- 
known 21, for j = 2,...,n we multiply the first equation by a;;/a1; and 
subtract the result from the 7th equation. For this we have to require that 
G11 # 0. Since we assume the matrix to be nonsingular, this can be achieved 
by reordering the rows or the columns of the given system. This procedure 
leads to a system of the form 


611321 + Dyo%o + +--+ + OinEn = 2% 
ae) ro oe a”) x, = yo? 
av Lot c+ + arn = yl?) 


with the new coefficients given by 
(1) 


big := Qi; , k=1,...,n, 
(1) (1) 

(2) ._ 9) _ 71 1k a 

Qi, T= Aj — J iy > q,k =2,...,n, 
Qi 


and the new right-hand sides given by 


(1) (2) (1) ay i” 
41 -= VY, Y; =U; -=a> j= 2,...,M. 
yy 
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Here, for the coefficients and right-hand sides of the original system we 
have set a’, [= aj, and y\? = Yj. 

Proceeding in this way, the given nxn system for the unknowns 7},...,2n 
is equivalently transformed into an (n—1) x (n—1) system for the unknowns 
£2,...,2%n- Adding a multiple of one row of a matrix to another row does 
not change the value of its determinant. Therefore, in the above elimina- 
tion the determinant of the system remains the same (with the exception 
of a possible change of its sign if the order of rows or columns is changed). 
Hence, the resulting (n — 1) x (n — 1) system for r2,...,2n) again has a 
nonvanishing determinant, and we can apply precisely the same procedure 
to eliminate the second unknown 22 from the remaining (n — 1) x (n — 1) 
system. 

By repeating this process we complete the forward elimination, by which 
the system of linear equations 


aya, + ayy + + +a en = yy” 
as) x, + al!) xo tot or aan, = ys? 
a) ax, + a) ro + ore + a) rn = ys) 


with a nonsingular matrix A = (a?) is equivalently transformed into a 
triangular system 


61121 + bi2Xe + vce + binkn = 21 


by9X9 + ve + bon&n = 29 


On—1,n-12n—1 + bn-1.nEn = Zn-1 
OnnZn = Zn 


by nm — 1 recursive elimination steps of the form 


a. a 
ant) = ay, — ee j,k=m+1,...,n, 
Qamm 
m=1,...,n—-—1 
(m+1) (nm) _ 3m Ym 
m m 
YU; = Y; nC j=m+1, nN, 
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The coefficients and the right-hand sides of the final triangular system are 
given by 


and 
z= YF, g=1,...,n. 


The condition a”) # 0, which is necessary for performing the algorithm, 


always can be achieved by a reordering of the rows or columns, since oth- 
erwise the matrix A would not be nonsingular. 

We would like to compress the operations of one elimination step into 
the following scheme 


where the rectangle illustrates the remaining part of the matrix and the 
right-hand side for which the elimination has to be performed. Here, a 
stands for the elimination element, or pivot element; the elements 6 in 
the elimination row remain unchanged; the elements c of the elimination 
column are replaced by zero (with the exception of the pivot element a); 
and the remaining elements d are changed according to the rule 


doa”. 
a 


We note that in computer calculations, of course, the new values for the 
coefficients of the matrix and the right-hand sides can be stored in the 
locations held by the old values. 

More explicitly, the entire Gaussian elimination can be written in the 
following algorithmic form. 


Algorithm 2.5 (Gaussian elimination) 


1. Forward elimination: 
Form=1,...,n—1 do 


forj =m+1,...,n do 


QAjmaQmk 
fork=m+1,...,n do aj. = Aj, — 
mm 


oa. AjmUm 
Yj YG 
Qmm 
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2. Backward substitution: 


Form=n,n-—1,...,1 do tm := Ym 
fork=m++1,...,n do fm := Lm — Amk@y 
Lm 
Lm i= 
QAmm 


If the matrix A is singular and has rank r, the elimination procedure 
will terminate after r steps. The matrix of the remaining (n — r) x (n—T) 
system for the unknowns 2,41,..., Zn 1s the zero matrix, because otherwise 
the rank of A would be different from r. Hence, in this case the given linear 
system is solvable if and only if the right-hand sides after r elimination 
steps satisfy 

Zr+l =---= 2, = 0. 


The solutions can be found from the triangular system by arbitrarily choos- 
ing @-41,---,£n and then recursively determining z,,...,2%o9. This way we 
obtain the (n — r)-dimensional solution manifold. 

In order to control the influence of roundoff errors we want to keep the 


quotient a) / al™) small; i.e., we want to have a large pivot element al™) 


Therefore, instead of only requiring alm) # 0, in practice, either complete 
pivoting or partial row or column pivoting is employed. For complete piv- 
oting, both the rows and the columns are reordered such that alm has 
maximal absolute value in the (n —-m+1) x (n-—m+1) matrix remaining 
for the mth forward elimination step. In order to minimize the additional 
computational cost caused by pivoting, for row (or column) pivoting the 
rows (or columns) are reordered such that alm) has maximal absolute value 
in the elimination column (or row), i.e., in the mth column (or row). Of 
course, in the actual implementation of the Gaussian elimination algorithm 
the reordering of rows and columns need not be done explicitly. Instead, 
the interchange may be done only implicitly by leaving the pivot element 
at its original location and keeping track of the interchange of rows and 
columns through the associated permutation matrix. 

The following example illustrates that partial pivoting does not always 
prevent loss of accuracy in the numerical computations. 


Example 2.6 We consider the system 


Ly + 20022 = 100 
ry + Lo = 1 


with the exact solution x; = 100/199 = 0.502..., x2 = 99/199 = 0.497.... 
For the following computations we use two-decimal-digit floating-point 
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arithmetic. Column pivoting leads to a1; as pivot element, and the elimi- 


nation yields 
ry o+ 200zr2 = 100 


— 2002x2 = —99, 


since 199 = 200 in two-digit floating-point representation. From the second 
equation we then have zz = 0.50 (0.495 = 0.50 in two decimal digits), and 
from the first equation it finally follows that x2, = 0. 

However, if by complete pivoting we choose aj2 as pivot element, the 


elimination leads to 
x1 + 200x2 = 100 


D4 = 0.5 


(0.995 = 1.00 in two decimal digits), and from this we get the solution 
£1 = 0.5, 2 = 0.5 (0.4975 = 0.50 in two decimal digits), which is correct 
to two decimal digits. oO 


Since complete pivoting is more costly than partial pivoting, in practical 
computations one can try to overcome the disadvantages of partial pivoting 
by scaling the matrix. This means that if B = D,ADbz, in order to obtain 
the solution x of Ax = y we first solve Bz = D,y for z and then determine 
x from x = Doz. Here D; and Dz are some diagonal matrices chosen such 
that for the matrix B the row and column sums of the absolute values are 
approximately equal. A diagonal matriz D = (d;,) is a matrix with the 
off-diagonal elements equal to zero; i.e., dj, = 0 for 7 # k. For a detailed 
discussion of scaling we refer to [27]. Unfortunately, there is no known 
general procedure for such scaling, i.e., for choosing the diagonal matrices 
D, and Do. 

For an estimate of the computational cost of Gaussian elimination we 
perform a count of the number of multiplications. By a, we denote the 
number of multiplications that are required for solving a triangular n x n 
system by back substitution. Obviously, for a, we have the recurrence 
relation 

An = An-1 +N, 


since we need n multiplications to obtain x, from the first equation after 
having already determined 2o,...,2%7,. Hence, we have 


_ ~ _ n(n 1) 
an = Dok 5 ; 


since a; = 1. By @,,, we denote the number of multiplications needed 
for the forward elimination simultaneously for r different right-hand sides. 
Here we have the recurrence relation 


Bn,r = Bn—1,r + (n + r)(n —_ 1), 
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since the elimination of the unknown 2x, requires n + r multiplications for 
each row of the n — 1 rows. From this it follows that 

n 3 

nn n(n—1)r 

Bor = Dk + rk 1) = rr a 


because (),, = 0. Adding ra, and 8, we obtain the following result. 


Theorem 2.7 Gaussian elimination for the simultaneous solution of an 
nxn system for r different right-hand sides requires a total of 


3 
n 2 
> trn” — 
3 


multiplications. 


The computational cost, counting only the multiplications, in Gaussian 
elimination is n°/3+O(n?). It is left to the reader to show that the number 
of additions is also n°/3 + O(n”) (see Problem 2.7). Doubling the number 
of unknowns increases the computation time by a factor of eight. Assuming 
1 sec = 10~® sec per addition and multiplication, i.e., on a computer with 
one million floating point operations per second, the solution of a system 
with n = 10° requires approximately ten minutes, and with n = 10% it 
requires approximately six days. This illustrates dramatically that for the 
solution of large linear systems iterative methods, which we will study in 
Chapter 4, are better suited than direct methods. Row or column pivoting 
leads to an additional cost proportional to n*, whereas complete pivoting 
adds costs proportional to n°. For the latter reason, complete pivoting is 
used only rarely in practical computations. 

The Gaussian algorithm also allows the computation of the determinant 
and the inverse of a matrix A. The determinant det A is simply given by the 
product of the diagonal elements in the triangular matrix obtained through 
the elimination procedure. If the determinant is computed using expansions 
by submatrices, then the operational count is n! multiplications, as com- 
pared to n°/3 for Gaussian elimination. This illustrates why Cramer’s rule 
for the solution of linear systems is only a theoretical mathematical tool 
and not a tool for practical computations. 

The inverse of a matrix is obtained by solving the linear system simul- 
taneously for the n right-hand sides given by the columns of the identity 
matrix, i.e., by solving the n systems 


Ax; = &i, a=1,...,n, 
where e; is the ith column of the identity matrix. Then the n solutions 


Z1,---,£n will provide the columns of the inverse matrix A~!. We would 
like to stress that one does not want to solve a system Ax = y by first 
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computing A~! and then evaluating x = A~'y, since this generally leads 
to considerably higher computational costs. 

The Gauss—Jordan method is an elimination algorithm that in each step 
eliminates the unknown both above and below the diagonal. The com- 
plete elimination procedure transforms the system equivalently into a di- 
agonal system. The multiplication count shows a computational cost of 
order n?/2 + O(n?), i-e., an increase of 50 percent over Gaussian elimina- 
tion. Hence, the Gauss—Jordan method is rarely used in applications. For 
details we refer to [26, 27]. 


2.3 LR Decomposition 


In the sequel we will indicate how Gaussian elimination provides an LR 
decomposition (or factorization) of a given matrix. 


Definition 2.8 A factorization of a matriz A into a product 
A=LR 


of a lower (left) triangular matriz L and an upper (right) triangular matriz 
R is called an LR decomposition of A. 


A matrix A = (a;x) is called lower triangular or left triangular if aj, = 0 
for 7 < k; it is called upper triangular or right triangular if aj, = 0 for 
j > k. The product of two lower (upper) triangular matrices again is lower 
(upper) triangular, lower (upper) triangular matrices with nonvanishing 
diagonal elements are nonsingular, and the inverse matrix of a lower (upper) 
triangular matrix again is lower (upper) triangular (see Problem 2.14). 


Theorem 2.9 For a nonsingular matriz A, Gaussian elimination (without 
reordering rows and columns) yields an LR decomposition. 


Proof. In the first elimination step we multiply the first equation by aj1/a11 
and subtract the result from the jth equation; i.e., the matrix A; = A is 
multiplied from the left by the lower triangular matrix 


Li = 


Qnl1 
_ nt 1 
Qi1 


The resulting matrix Ag = L; A, is of the form 
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where A,,_; is an (n—1) x (n—1) matrix. In the second step the same pro- 
cedure is repeated for the (n— 1) x (n—1) matrix A,_1. The corresponding 
(n — 1) x (n — 1) elimination matrix is completed as an n x n triangular 
matrix Le by setting the diagonal element in the first row equal to one. In 
this way, n — 1 elimination steps lead to 


[n-1-7' A=R, 


with nonsingular lower triangular matrices L,,...,2,—, and an upper tri- 
angular matrix R. From this we find 

A= LR, 
where L denotes the inverse of the product Ly,_1--- Ly. oO 


We wish to point out that not every nonsingular matrix allows an LR 
decomposition. For example, 
0 1 
1 O 


has no LR decomposition. However, since Gaussian elimination with row 
reordering always works, for each nonsingular matrix A there exists a per- 
mutation matrix P such that PA has an LR decomposition (see Problem 


2.16). A permutation matrix is a matrix of the form P = (€9(1),---,€p(n)) 
where €1,...,€n are the columns of the identity matrix and p(1),..., p(n) 
is a permuation of 1,...,n. 


Recall that an n xn matrix A is called symmetric if it has real coefficients 
and A = A’. A symmetric matrix A is called positive definite if x! Ar > 0 
for all x € IR” with x 0. Positive definite matrices have positive diagonal 
elements (see Problem 2.10), and therefore a reordering of rows and columns 
is not necessary for Gaussian elimination (for pivoting, the largest diagonal 
element is chosen). It can be shown (see Problem 2.13) that symmetry and 
positive definiteness are preserved throughout the elimination if diagonal 
elements are taken as pivot elements. Therefore, for symmetric positive 
definite matrices the LR decomposition is always possible. If A = LR, then 
we have also A = A? = R'L’, and from Problem 2.15 we can deduce that 
L can be normalized such that A = LL’. Such a decomposition is used 
in the Cholesky method for the solution of linear systems with symmetric 
positive definite matrices. Because of symmetry, the computational cost 
for the Cholesky method is n°/6+ O(n?) multiplications and n° /6 + O(n?) 
additions. For details we refer to [26, 27]. 


2.4 QR Decomposition 


We conclude this chapter by describing a second elimination method for 
linear systems, which leads to a QR decomposition. 
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Definition 2.10 A factorization of a matriz A into a product 
A=QR 


of a unitary matriz Q and an upper (right) triangular matriz R is called a 
QR decomposition of A. 


We recall that a matrix Q is called unitary if 
QQ" = Q*Q = I. 


The product of two unitary matrices again is unitary. 

In terms of the columns of the matrices A = (aj,...,@,) and 
Q = (q1,---,Qn) and the coefficients of R = (rjx), the QR decomposition 
A = QR means that 


k 
Qk = S rindi, k= 1,...,n. (2.7) 

i=] 
Hence, the vectors a1,...,@n, of C” have to be orthonormalized from the 
left to the right into an orthonormal basis qi,...,q@n. This, for example, can 


be achieved by the Gram—Schmidt orthonormalization procedure (see The- 
orem 3.18). However, since the Gram—Schmidt orthonormalization tends to 
be numerically unstable, we describe the QR decomposition by Householder 
matrices. 


Definition 2.11 A matriz H of the form 
H=I-—2vv’, 


where v is column vector with v*v = 1, i.e., a unit vector, is called a 
Householder matrix. 


Remark 2.12 Householder matrices are unitary and satisfy H = H”*. 
Proof. We compute 
H* = [* — 2(vv")* =I —-2vv* =A 
and 
HH* = H*H = (I — 2vv*) (J — Qvuvu*) = I — 4u0* + 4v0* v0" = TF, 
where we use that v*v = 1. O 


Geometrically a Householder matrix corresponds to reflection across the 
plane through the origin orthogonal to v. ‘To see this we write 


g=vurt+y 
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with the component vu*z of x € €” in the v-direction and a component y 
orthogonal to v. Then we obtain 


Hxe=2-—2Qvv'e = -vv' r+ y; 


i.e., Hx has the opposite component —vv*zx in the v-direction and the 
same component y orthogonal to v. Because of this property, Householder 
matrices are also called elementary reflection matrices. 

We now describe the elimination of the unknown 2; by multiplying A 
from the left by a Householder matrix H, = I — 2u,vj. By a; we denote 
the first column of A and by eg the kth column of the identity matrix; in 
particular, e; = (1,0,...,0)*. Then the first column 6; of the product H,A 
is given by 

by = Hy, Ae, = Ha, = a, — 2010; a). 


We would like to achieve that b} = oe; with o # 0. Hence, except for the 
first row, v; must be a multiple of a,. Therefore, we try 


Uy = a, Foe, (2.8) 
with 


Q11 
—— ,/ajay , ai, #0, 


jax. | 


Vajzay , a1, = 0. 
UU = 2(a; a1 + lari |[,fajar ) 


Then we have 


and 


* __ * * __ ] * 
Uj; = a; a1 F |aii|/ajai = 5 uius- 


Without loss of generality we may assume that ,/afa; — |a11| > 0, since 
otherwise we would have that a; = a1, 1, 1.e., that the first column already 
has the required form. Therefore, if we finally choose 


Ul 
y) 
fuji 


then v, is a unit vector, and as requested we have 


Uz = 


a = a, — Uj U7; a1 = A, — Ui = +oe\. 


* 
11 


The remaining columns by = Hj, Ae, are obtained from the columns a, of 
A by 


* 
U, Qk 
_ _ _ 1 
be = Hi Ae = Hiag = ag — 2010p a4 = a — ul, kK=2,...,n. 
141 
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From the two possible signs in (2.8) the positive sign yields the numerically 
more stable variant. 

The same procedure is now repeated for the remaining (n — 1) x (n — 1) 
matrix. The corresponding (n — 1) x (n — 1) Householder matrix has to be 
completed as an n x n Householder matrix. In general, if Ay is ann x n 


matrix of the form 
_ f Re * 
At = ( a en 


with a k x k upper triangular matrix R;, and an (n — k) x (n — k) matrix 
Ay, we apply the Householder transformation described above with the 
first column of A,_;,. With the corresponding (n—k) x (n—k) Householder 


matrix H,_, the n x n matrix 


yields an n x n-Householder matrix H; that leaves the first k columns 
in triangular form and, in addition, transforms the (k + 1)st column into 
triangular form. In this way, after at most n — 1 steps, we arrive at 


Ay-1::-H;A=R 


with Householder matrices H,,...,H,—; and an upper triangular matrix 
R. From this we obtain 


A=QR 
with the unitary matrix 
Q = (Hn-1:+: M1)" = Mi -+- An-1. 
We summarize our result in the following theorem. 


Theorem 2.13 To eachn xn matriz a QR decomposition can be obtained 
through n — 1 Householder transformations. 


The elimination by QR decomposition via Householder matrices can be 
considered as an alternative to Gaussian elimination, since it does not need 
pivoting. However, the operation count shows that 2n?/3 + O(n?) multi- 
plications are required (see Problem 2.18), i.e., twice the cost of Gaussian 
elimination, and the added expense of partial pivoting in Gaussian elim- 
ination does not close this gap. Hence, QR decomposition is rarely used 
for the solution of linear systems. But later in this book we will see that 
QR decomposition is an essential part of one of the best algorithms for 
numerically computing the eigenvalues of a matrix (see Section 7.4). 
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Problems 


2.1 Solve the linear system 


27, + 472 + 23 = 4 


241 + 642 — 23 = 10 


[I 
i) 


£1 + 522 + 273 
by Gaussian elimination. 


2.2 Write a computer program for the solution of a system of linear equations 
by Gaussian elimination with partial pivoting and test it for various examples. 
You will need this code as part of other numerical algorithms later in this book. 


2.3 Describe pivoting in Gaussian elimination by using permutation matrices. 


2.4 Let A and B be two n x n matrices. Show that if AB is nonsingular, then 
A and B are nonsingular. 


2.5 Let A,B,C, and D be n x n matrices and let A be nonsingular. Show that 


A B\ | -] 
det ( C  ) = det Adet(D— CA B). 


2.6 Verify the summation formulas 
Stk tainty) and Se =} nln tayent) 
k=1 ? k=1 6 


that were used in the proof of Theorem 2.7. 


2.7 Prove the analogue of Theorem 2.7 for the number of additions in Gaussian 
elimination. 


2.8 Show that tridiagonal matrices 


ai Cl 
bo a2 C2 
bs a3 C3 


with the properties 
la;| > [bj)| + |ej|, bse) #0, 7 =2,...,n—-1, 
and |ai| > |ci| > 0 and |an| > |bn| > 0 are nonsingular. 


2.9 Show that Gaussian elimination for tridiagonal n x n matrices requires 4n 
multiplications. 
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2.10 Show that the diagonal elements of a positive definite matrix are positive. 


2.11 Prove that if A = LL’ where L is a real lower triangular nonsingular n x n 
matrix, then A is symmetric and positive definite. 


1 2 3 
A=|2 3 4 
3.4 4 


2.13 Show that for a symmetric positive definite matrix the symmetry and pos- 
itive definiteness are preserved in Gaussian elimination if diagonal elements are 
taken as pivot elements, i.e., the submatrices a, i,k =m,...,n, are Symmetric 
and positive definite. 


2.12 Show that 


is not positive definite. 


2.14 Show that the product of two lower (upper) triangular matrices again is 
lower (upper) triangular, that lower (upper) triangular matrices with nonvanish- 
ing diagonal elements are nonsingular, and that the inverse matrix of a lower 
(upper) triangular matrix again is lower (upper) triangular. 


2.15 Let A be a nonsingular matrix and suppose A = £, R; = L2Re, where Ll, 
and Lz are lower triangular matrices with diagonal elements equal to one and FR, 
and Re are upper triangular matrices. Show that ZL; = Lez and Ri = Re. 


2.16 Show that for each nonsingular n x n matrix A there exists a permutation 
matrix P such that PA has an LR decomposition. 


2.17 Solve the linear system 
21 + 622 — 2473 = 5 


27, + £2 — 243 = 1 


221 + 2x2 + 623 = 10 
by QR decomposition. 


2.18 Show that the solution of an n x n linear system by QR elimination with 
Householder matrices requires 2n*/3 + O(n”) multiplications. 


2.19 Let A be a complex n x n matrix and y € C” and assume that A, Re A, 
and Im A are nonsingular. Show that the n x n complex linear system Az = y is 
equivalent to the two n xX n real systems 


{(Im A)~! Re A + (Re A)~'Im A} Rez = (Im A)~' Rey + (Re A)~' Imy, 
{(Im A)~' Re A+ (Re A)~ ‘Im A} Ima = (Im A)" Imy — (Re A)~* Rey. 
2.20 Use QR decomposition to prove Hadamard’s inequality 
|det Al? < [[ S— laze? 
j=l k=1 


for the determinant of an n x n matrix A = (aj;x). 
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Basic Functional Analysis 


In the subsequent chapters we want to discuss iterative methods for the 
solution of systems of linear and nonlinear equations. For this we will need 
some fundamental concepts of functional analysis, which we will start to 
develop now. We shall use these functional-analytic tools also in later parts 
of this book in some of our convergence and error analysis for the approx- 
imate solution of differential and integral equations. 

We begin by introducing the notions of normed spaces and their ele- 
mentary properties, where we assume that the reader is familiar with the 
concept of linear spaces or vector spaces and their basic properties. ‘Then 
we proceed by considering scalar product spaces as special cases of normed 
spaces. 

We will continue with the discussion of linear and continuous operators 
acting between normed spaces. Particular attention is given to linear oper- 
ators between finite-dimensional spaces, i.e., to matrices and their various 
norms. The main part of this chapter is Banach’s fixed point theorem, also 
known as the contraction mapping principle, which is one of the most im- 
portant tools in numerical analysis and is the fundamental basis of our 
investigations of iterative methods for linear and nonlinear systems. At the 
end of the chapter we will introduce some of the basic concepts of approx- 
imation theory, which will be useful later in other parts of this book. 

For a broader and more detailed study we refer to [5, 34, 35, 39, 59] or 
any other introductory book on functional analysis. 
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3.1 Normed Spaces 


Definition 3.1 Let X be a complex (or real) linear space (vector space). 


A function || - ||: X — IR with the properties 
(N1) la|] > 0, (positivity) 
(N2) |z|| = O if and only if x =0, (definiteness) 
(N3) —laz|| = Ja/ [lel], (homogeneity) 
(N4) lle +yll < lel] + llyll, (triangle inequality) 


for all x,y € X and alla € C (or R) ts called a norm on X. A linear 
space X equipped with a norm is called a normed space. For X = IR” or 
X =C” we will also call the norm a vector norm. 


Example 3.2 Some examples of norms on IR” and C” are given by 


1/2 
nr n 
-— -— 2 -_— 
lel = Solel, elle = (Soles!) etloo == max fay 
j=l j=l 
for z = (1,...,%n)'. It is an easy exercise for the reader to verify that 
the norm axioms (N1)—(N3) are satisfied. The triangle inequality for the 
norms || - ||; and || - ||. follows immediately from the triangle inequality in 
IR or C. The verification of the triangle inequality for the norm || - ||2 is 


postponed until Section 3.2. 


The norms in Example 3.2 are denoted the @,, 02, and €., norm, respec- 
tively. For obvious reasons the £2 norm is also called the Euclidean norm, 
and the @,, norm is called the maximum norm. The three norms are special 


cases of the @, norm 
1/p 


n 
lItllp:= {| doles], (3.1) 
j=1 
defined for any real number p > 1. The €., norm is the limiting case of 
(3.1) as p — oo (see Problem 3.1). 
Remark 3.3 For each norm, the second triangle inequality 
Hell — yl < Ile — yl 
holds for allz,y€ X. 


Proof. From the triangle inequality we have 


llzl| = [la —y + yll < Ile — yll + Ilyll, 
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whence ||z|| — |ly|| < ||z — y|| follows. Analogously, by interchanging the 
roles of x and y we have |ly|] — |[a|| < |ly — al]. 0 


For two elements z,y in a normed space ||z — y|| is called the distance 
between x and y. 


Definition 3.4 A sequence (x,) of elements in a normed space X is called 
convergent if there exists an element x € X such that 


lim ||z, — x|| = 0, 
noo 


i.e., if for every € > 0 there exists an integer N(e) such that ||z, —2|| < 
for alln > N(e). The element x is called the limit of the sequence (z,), 
and we write 


lim 2, =2 
noo 


or 
In 72, N->oo. 


A sequence that does not converge is called divergent. 
Theorem 3.5 The limit of a convergent sequence is uniquely determined. 


Proof. Assume that x, — x and 2, — y for n > oo. Then from the triangle 
inequality we obtain that 


lz — yl] = [le -— tn +n —yll < [le -— tall + len —yll 90, no. 
Therefore, ||z — y|| = 0 and x = y by (N2). Oo 


Definition 3.6 Two norms on a linear space are called equivalent if they 
have the same convergent sequences. 


Theorem 3.7 Two norms ||-||q and ||-||5 on a linear space X are equivalent 
if and only if there exist positive numbers c and C’ such that 


cllzlla < llellb < Cllalla 


for alla € X. The limits with respect to the two norms coincide. 


Proof. Provided that the conditions are satisfied, from ||z, — z||, — 0, 
n — oo, it follows that ||z, — ||, > 0, n + ov, and vice versa. 
Conversely, let the two norms be equivalent and assume that there is 
no C' > 0 such that ||z||, < C||z||, for all 2 € X. Then there exists a 
sequence (r,,) with ||zp||4 = 1 and ||zn||h) > n?. Now, the sequence (yn) 
with y, := £,/n converges to zero with respect to || - ||,, whereas with 
respect to || - ||, it is divergent because of ||yn||» > n. O 


Theorem 3.8 On a finite-dimensional linear space all norms are equiva- 
lent. 
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Proof. In a linear space X with finite dimension n and basis uj,...,Un 
every element can be expressed in the form 


n 
t= S- OjU;- 
j=l 
As in Example 3.2, 
|Z loo == max _|a;| (3.2) 
j=l,....n 


defines a norm on X. Let || - || denote any other norm on X. Then, by the 
triangle inequality we have 


(f) 
zl < Dd les Leesll < Clleloo 


j=l 


for all s € X, where 


j=1 
Assume that there is no c > 0 such that cl|z||.. < ||z|| for all z EX . 
Then there exists a sequence (z,) with ||z,|] = 1 such that ||z_||,.. > v. 


Consider the sequence (y,) with y, := £,/||z_||o. and write 


n 
Up = } Ajvu;.- 
j=l 


Because of ||y,||o0 = 1 each of the sequences (a;,), 7 = 1,...,n, is bounded 
in ©. Hence, by the Bolzano—Weierstrass theorem we can select convergent 
subsequences a; ,(¢) + aj, € — 0, for each j = 1,...,n. This now implies 


lIyv(2) — Ylloo + 0, € + oo, where 


n 
Y= ) AjUu;, 
j=1 


and also ||y,(e) — yl] < Clly.ce) — ylloo + 0, € 4 oo. But on the other hand 
we have ||y,|| = 1/||xv||~. 4 0, vy > oo. Therefore, y = 0, and consequently 
lle) Ilo + 0, € 4 00, which contradicts ||y,||oo = 1 for all v. Oo 


The following definitions carry over some useful concepts from Euclidean 
space to general normed spaces. 


Definition 3.9 A subset U of a normed space X is called closed if it con- 
tains all limits of convergent sequences of U. The closure U of a subset U 
of a normed space X is the set of all limits of convergent sequences of U. A 
subset U is called open if its complement X \U is closed. A set U is called 
dense in another set V if V CU, i.e., if each element in V is the limit of 
a convergent sequence from U. 
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Obviously, a subset U is closed if and only if it coincides with its closure. 
For 2 in X and r > 0 the set Blzo,r] := {2 € X : ||z — Zo|| < r} is closed 
and is called the closed ball of radius r and center x9. Correspondingly, the 
set B(xo,r) := {x € X : ||z — zo|| < r} is open and is called an open ball. 


Definition 3.10 A subset U of a normed space X is called bounded if 
there exists a positive number C such that ||x|| < C for all x € U. 


Convergent sequences are bounded (see Problem 3.6). 


Theorem 3.11 Any bounded sequence in a finite-dimensional normed space 
X contains a convergent subsequence. 


Proof. Let u,,...,Un be a basis of X and let (2,) be a bounded sequence. 
Then writing 
nr 
Ly = > Qjyu; 
j=l 


and using the norm (3.2), as in the proof of Theorem 3.8 we deduce that 


each of the sequences (a;,), j = 1,...,n, is bounded in C. Hence, by 
the Bolzano—Weierstrass theorem we can select convergent subsequences 
Qj,y(e) > aj, € > oo, for each j = 1,...,n. This now implies 


Te 
Ly(t) > So aj; EX, loom, 
j=l 


and the proof is finished. O 


3.2 Scalar Products 


Definition 3.12 Let X be a complex (or real) linear space. Then a func- 
tion (-,-): X x X > C (or R) with the properties 


(H1) (a,2) > 0, (positivity) 
(H2) (c,) = 0 if and only if x =0, (definiteness) 
(H3) (x,y) = (y,2), (symmetry) 
(H4) (ax + By,z) = a(z,z) + Bly, 2); (linearity) 


for all z,y,z € X anda, € C (or R) is called a scalar product, or an 
inner product, on X. (By the bar we denote the complex conjugate.) A 
linear space X equipped with a scalar product is called a pre-Hilbert space. 
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As a simple consequence of (H3) and (H4) we note the antilinearity 
(H4’) (x, ay + Bz) = &(x,y) + B(z, 2). 


Example 3.13 An ezample of a scalar product on IR” and €” is given by 


forz=(%%1,...,2n)* andy =(y1,...,yn)!. (Note that (x,y) = y*z.) 


Theorem 3.14 For a scalar product we have the Cauchy—Schwarz inequal- 
ity 

I(z,y)|? < (2, 2)(y,y) 
for allz,y € X, with equality if and only if x and y are linearly dependent. 


Proof. The inequality is trivial for x = 0. For x # 0 it follows from 


(ax + By, ax + By) = |al?(x, x) + 2Re{aB(ax,y)} + |B)? (y,y) 


= (z,2)(y,y) — |(z,y))*, 


where we have set a = —(a,2)~'/?(z,y) and 6 = (z,x)!/?. Since (-,-) is 
positive definite, this expression is nonnegative, and it is equal to zero if 
and only if az + By = 0. In the latter case x and y are linearly dependent 


because 2 £ 0. oO 
Theorem 3.15 A scalar product (-,-) on a linear space X defines a norm 
by 

lel] == (x, )'/? 


for all x € X; 1.e., a pre-Hilbert space is always a normed space. 


Proof. We leave it as an exercise for the reader to verify the norm axioms. 
The triangle inequality follows by 


Iz + yll” = (@+y,z+y) < lell’ + 2lle|l llyll + llyll? = (ell + Hlyll)” 


from the Cauchy—Schwarz inequality. O 


Note that we can rewrite the Cauchy—Schwarz inequality in the form 


(x, y)1 < (lll lly. 


The scalar product of Example 3.13 generates the Euclidean norm of Ex- 
ample 3.2, and therefore it is called the Euclidean scalar product. Theorem 
3.15 includes the triangle inequality for the Euclidean norm that we post- 
poned in Example 3.2. 

The following definition generalizes the concept of orthogonality from 
Euclidean space to pre-Hilbert spaces. 
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Definition 3.16 Two elements x and y of a pre-Hilbert space X are called 
orthogonal if 
(x,y) = 0. 


Two subsets U and V of X are called orthogonal if each pair of elements 
x €U andy € V are orthogonal. For two orthogonal elements or subsets 
we write x 1 y and U L V, respectively. A subset U of X is called an 
orthogonal system if (z,y) = 0 for allz,y € U with x # y. An orthogonal 
system U is called an orthonormal system if ||z|| = 1 for all xz € U. 


Theorem 3.17 The elements of an orthonormal system are linearly inde- 
pendent. 


Proof. From 
S- ange = 0 
k=1 


for the orthonormal system {q:,...,¢n}, by taking the scalar product with 
qj, we immediately have that a; = 0 for 7 =1,...,n. 0 


The Gram-—Schmidt orthogonalization procedure as described in the fol- 
lowing theorem provides a converse of Theorem 3.17. For a subset U of 
a linear space X we denote the set spanned by all linear combinations of 
elements of U by span{U}. 


Theorem 3.18 Let {uo,ui,...} be a finite or countable number of linearly 
independent elements of a pre-Hilbert space. Then there exists a uniquely 
determined orthogonal system {qo,q1,---.} of the form 


dn =Unt+Tn, n=O0,1,..., (3.3) 
with ro = 0 andr, € span{uo,...,Un-1}, N= 1,2,..., satisfying 
span{uo,...,Un} = span{gqo,---,dn}, nm=O0,1,.... (3.4) 


Proof. Assume that we have constructed orthogonal elements of the form 
(3.3) with the property (3.4) up to qn_1. By (3.4), the {qo,...,qn—1} are 
linearly independent, and therefore ||q,|| 4 0 for k = 0,1,...,n—1. Hence, 


n-1 


Uns Tk 
In *= tn — J, Senet) nde) 


k 
<= (Gk I) 


is well-defined, and using the induction assumption, we obtain (qn,qdm) = 0 
form =0,...,n—1 and 


span{uo,.--,Un—1,Un} = span{qo,---,;4n—1,Un} = span{qo,.--,@n—1,Qn}- 


Hence, the existence of q, is established. 
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Assume that {go,q1,...} and {Go,@1,...} are two orthogonal sets of el- 
ements with the required properties. Then clearly gg = up = go. Assume 
that we have shown that equality holds up to gn_1 = @n—1. Then, since 
Gn — Qn € Span{ug,...,Un—1}, we can represent gy — Gn as a linear combi- 
nation of q1,...,@n—1; i.e., 


n—1l 
dn - Qn = SS ae dk: 
k=0 


Now the orthogonality yields 


n—-l 
lan _ Gn\|? = («, — dn; 3 out] = 0, 
k=0 


whence gn = Gn- a 


3.3 Bounded Linear Operators 


By the symbol A : X — Y we will denote a mapping whose domain of 
definition is a set X and whose range is contained in a set Y; i.e., for every 
x € X the mapping A assigns a unique element Ax € Y. The range is the 
set A(X) := {Ar : « € X} of all image elements. We will use the terms 
mapping, function, and operator synonymously. (We have already used this 
convention in Definitions 3.1 and 3.12.) 


Definition 3.19 An operator A mapping a subset U of a normed space X 
into a normed space Y is called continuous at x € U if for every sequence 
(tn) from U with limyn_.o fn = © we have limyn_4o) Atn = Ax. The function 
A:U -Y is called continuous if it is continuous for all x € U. 


An equivalent definition is the following: A function A: U Cc X ~ Y 
is continuous at x € U if for every ¢ > 0 there exists 6 > 0 such that 
|| Az — Ay|| < € for all y € U with ||x — y|| < 6. Here we have used the same 
symbol || - |] for the norms on X and Y. Note that by the second triangle 
inequality of Remark 3.3 the norm is a continuous function. 


Definition 3.20 An operator A: X — Y mapping a linear space X into 
a linear space Y is called linear if 


A(axz + By) = aAx + BAy 
for allz,y € X and alla,BeEC (or R). 


Theorem 3.21 A linear operator is continuous if it is continuous at one 
element. 
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Proof. Let A: X > Y be continuous at zo € X. Then for every x € X and 
every sequence (z,) with 2, - Z, n — oo, we have 


Aty = A(fn —2+29)+ A(x—29) > A(ao) + A(e—2Z0) = A(z), n> co, 
since Ln —£L£+2% > o,n > Ww. O 


Definition 3.22 A linear operator A: X — Y from a normed space X 
into a normed space Y is called bounded if there exists a positive number 
C' such that 


||Az|| < Cl|a| 
for all x € X. Each number C for which this inequality holds ts called a 
bound for the operator A. (Again we have used the same symbol || - || for 


the norms on X and Y.) 
Theorem 3.23 A linear operator A: X + Y is bounded if and only if 
|Al] = sup |[Az|] < co. 
Il ||=1 


The number ||Al| is the smallest bound for A and is called the norm of A. 
Proof. Assume that A is bounded with the bound C’. Then 


sup ||Aa|| < C, 
zl|=1 


and, in particular, ||A|| is less than or equal to any bound for A. Conversely, 
if || A|| < oo, then using the linearity of A and the homogeneity of the norm, 


we find that 
x 
4(aa) 
Ilz|| 


for all c 4 0. Therefore, A is bounded with the bound ||A||- O 


||Az|| = Ill] < [ATT [lel] 


Theorem 3.24 A linear operator is continuous if and only if it is bounded. 


Proof. Let A : X — Y be bounded and let (z,,) be a sequence in X with 
In —- 0, n > oo. Then from ||Az,|| < C||z,|| it follows that Az, — 0, 
n — oo. Thus, A is continuous at x = 0, and because of Theorem 3.21 it is 
continuous everywhere in X. 

Conversely, let A be continuous and assume that there is no C' > 0 such 
that ||Az|| < C||z|| for all 2 € X. Then there exists a sequence (z,) in X 


with ||z,|| = 1 and ||Az,|| > n. Consider the sequence yy := £p/||Arnll. 
Then y, > 0,n — oo, and since A is continuous, Ay, — A(0) = 0,n > oo. 
This is a contradiction to ||Ay,|] = 1 for all n. Hence, A is bounded. O 


Remark 3.25 Let X,Y, and Z be normed spaces and let A: X + Y and 
B:Y — Z be bounded linear operators. Then the product BA: X > Z, 
defined by (BA)x := B(Azx) for all x € X, is a bounded linear operator 
with ||BA|| < ||Al| || Bl]. 


Proof. This follows from ||(BA)z|| = ||B(Az)|| < ||BI| |All [lz]. O 
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Theorem 3.26 Let (a;,) be a real or compler n x n matriz. Then the 
linear operators A: IR” > IR” and A: C” > C”, defined by 


(Az); := Sank, j=l,...,n, 


k=1 


are bounded with respect to each norm on IR” and C”. In particular, we 
have 


|All: = max > lajel, (3.5) 
= ron 
Allo = max > lajel, (3.6) 
k=1 
. 1/2 
Alla < | Dy taal? } (3.7) 
j,k=1 


In this case the norms are also called matrix norms. (Note that in (3.5)- 
(3.7) both the domain and the range are given the same norm.) 


Proof. By Theorem 3.8 it suffices to prove boundedness of A with respect 


to one norm. For || - ||; we can estimate 
n 7m Tm 
|Az|]1 = » (Az); | = \_ S > ajerh 


nm 


nr nr Tr 
< 5 |ze| 5 |ajx| < | max 5 laje| 5 |zpl. 
k=1,...,.n 4 
k=1 j j=l k=1 


—] 


Therefore, we have that 
mr 
[|All < max > |ajx|. (3.8) 
k=1,...,n * 
j=l 
Now choose 7 such that 
n Tm 
» jaji| = max DL |ajx|, 
j=1 j=1 


and choose z € IR” with z; = 1 and z, = 0 for k #7. Then ||z||,; = 1 and 


n n n 
S- QAjkZk| = Ss” |a;i| = , ax S- lajx|- 
jal = poe 


k=1 


n 


Az = (Az) = 


j=l 
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Hence 
All, = Sup Aa]: > |JAz|h = | max > Jaye, (3.9) 
Til1—= 
and from (3.8) and (3.9) we obtain (3.5). 
For || - ||. we can estimate 
|| Az|loo = , max 2 Aa) 5 = max ae 


< max >» laje| (eal S max >> Jajal max in TRI: 
j= 
=1 


Therefore, we have that 
nr 
Alloo < -max ) laze. (3.10) 
k=1 
Now choose 7 such that 
n nm 
Y= lal = max >) aja, 
k=1 nee k=1 


and choose z € ©” with zy, = Gjz/|aix| if ai, ZO and z, = 1 if ay, = 0. 
Then ||z{|.. = 1 and 


) 
Azlloo = -max |(Az);| = max Dain 
> Se] = 3 laiz| = ; max >? |a5x|- 
Hence 
Aloo = Fin |AZ|lo0 2 ||Az|loo = jmax “> Jajn |, (3.11) 
Ziloo= 
and from (3.10) and (3.11) we obtain (3.6). 
Finally, for || - ||2, using the Cauchy—Schwarz inequality we can estimate 
rm n nr 2 
Axl} = S0 (Ax); )? = 40] ajeas 
j=l j=1 |k=1 


me nm nr 
<>) ps laze” >) a = 3 laze |” 3 lee”. 
j=1 \k=1 k=1 


j,k=1 
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Therefore, 
n 


WIS < D> lajel?, 


j,k=1 


and (3.7) is proven. In this inequality equality does not hold, in general, as 
can be seen by considering the identity matrix. O 


In order to derive a representation for ||A||> we need to recall the defini- 
tion and some basic facts about eigenvalues and eigenvectors of a matrix. 
A number A € C€ is called an eigenvalue of the matrix A if there exists a 
vector x € C” with x £ 0 such that 


Az = Xx. 


The vector z is called an eigenvector for the eigenvalue \’. Each n x n 
matrix has at least one and at most n eigenvalues, since the characteristic 
polynomial det(A — AJ) has at least one and at most n zeros. Eigenvectors 
for different eigenvalues are linearly independent (see Problem 3.12). The 
algebraic multiplicity of an eigenvalue of a matrix is its multiplicity as a zero 
of the characteristic polynomial; its geometric multiplicity is the number of 
linearly independent eigenvectors associated with the eigenvalue. 


Theorem 3.27 To each matriz A there exists a unitary matrix Q such 
that Q* AQ is an upper triangular matricz. 


Proof. Assume that it has been shown that for each (n — 1) x (n — 1) 
matrix A,_; there exists a unitary (n —1) x (n—1) matrix Qn_1 such that 
Q* ,;An—-1Qn-1 is an upper triangular matrix. Let \ be an eigenvalue of the 
nxn matrix A, with eigenvector u. We may assume that (u,u) = 1, where 
(-,-) is the Euclidean scalar product. Using the Gram—Schmidt procedure 
of Theorem 3.18 we can construct an orthonormal basis of C” of the form 
U,UV2,---,Un- Then we define a unitary n x n matrix by 


Un, := (u,v2,---,Un)- 


With the aid of (u,v;) = 0, 7 = 2,...,n, we see that 


U;ApUy = UF (Au, Anv2,---,;AnUn) = ( A A. ; ). 


with some (n—1) x (n—1) matrix A,_1. By the induction assumption there 
exists a unitary (n — 1) x (n — 1) matrix Qn_1 such that Q*_,An-1Qn-1 
is upper triangular. ‘Then 


_ 1 0 
On =Un( 4 a, ) 


defines a unitary n x n matrix, and Q* A,Q,, is upper triangular. oO 
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Lemma 3.28 For ann xn matriz A and its adjoint A* we have that 


(Az, y) = (x, A*y) 
for all x,y € ©", where (-,-) denotes the Euclidean scalar product. 


Proof. Simple calculations yield 


(Az,y) =) (Az)j9; = S So anand, 


j=l g=1 k=1 
n n nm 
TK nas Ak, x 
= S0>o tnag jy; = >_ te A* ye = (2, A*y), 
k=1 j=1 k=1 
where we have used that aj, = @jr. O 


Theorem 3.29 The eigenvalues of a Hermitian n xn matriz are real, and 
the eigenvectors form an orthogonal basis in C”. 


Proof. If A is Hermitian, i.e., if A = A*, then the matrix A := Q* AQ from 
Theorem 3.27 is also Hermitian, since 
A* _ (Q* AQ)* _ Q* A*Q** _ Q* AQ _— A. 


Therefore, in this case the upper triangular matrix A must be diagonal; 
1.€., 
A = D := diag(\1,..., An). 


Since from Q* AQ = D it follows that AQ = QD, we can conclude that 
the columns of Q = (u1,...,Un) satisfy Au; = Ajuj, 7 = 1,...,n. Hence 
the eigenvectors of a Hermitian matrix form an orthogonal basis in C”. 
Because of 


Aj = (Auj, uj) = (uj, Auj) = (Auj,uj) = Aj, 
the eigenvalues of Hermitian matrices are real. 0 


For a positive semidefinite matrix A, i.e., for a Hermitian matrix with 
the property 
(Az,z) >0, rE”, 


all eigenvalues are real and nonnegative. Analogously, the eigenvalues of a 
positive definite matrix A, i.e., of a Hermitian matrix with the property 


(Az,z) >0, x EC", «£0, 


are positive. 
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Definition 3.30 The number 
p(A) := max {|A| : A eigenvalue of A} 
is called the spectral radius of A. 
Theorem 3.31 For ann x n matriz A we have 
Alle = Vp(A* A). 


If A is Hermitian, then 
|| All2 = p(A). 


Proof. From Lemma 3.28 we have that 
|| Az|| = (Az, Ax) = (2, A* Az) 


for all c € ©”. Hence the Hermitian matrix A*A is positive semidefinite 
and therefore has n orthonormal eigenvectors 


A* Au; = piu, g=1,...,n, 


with real nonnegative eigenvalues. We use the orthonormal basis of eigen- 
vectors and represent x € C” by 


n 
tL = ) Aju; 
j=l 


and have 


Tr 
\lz||3 = (2,2) = Dad aan = sla,” 
j=l 


and 
|| Az||5 = (Az, Az) = (2, A* Az) = (oem, Do aaan = Sula). 
j=l 


From this we obtain that 
|| Aa||3 < p(A*A)]|lall3, 


whence 


Alls < p(A*A) 


follows. On the other hand, if we choose j such that yj = p(A* A), then we 
have that 


||Alla = f SUP || Axl]? > ||Augll2 = (uj, A* Aus) = 45 = p(A* A). 


Z\jg=1 
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This concludes the proof of ||All2 = ./p(A*A). If A is Hermitian, then 
A* A = A”, whence p(A* A) = p(A”) = [p(A)]? follows. Oo 


The following final theorem of this section is of basic importance for 
establishing a necessary and sufficient condition for the convergence of it- 
erative methods for linear systems. 


Theorem 3.32 For each norm on C” and each n x n matriz A we have 
that 
p(A) < |All. 
Conversely, to each matrix A and each € > 0 there exists a norm on C" 
such that 
|| Al] < p(A) +. 


Proof. Let » be an eigenvalue of A with eigenvector u. We may assume that 
||z|| = 1. Then the first part of the theorem follows from 


|| Al] = cup ||Az|| > || Aull = |]Aul] = [Al 


z||=1 


For the second part, by Theorem 3.27 there exists a unitary matrix Q 
such that 


bi bie b13 . bin 

boo = bo3 «Dan 

B= Q*AQ = a 
bnn 


is upper triangular. Because of det(AJ — A) = det(AI — B), the eigenvalues 
of A are given by A; = 6;;, 7 = 1,...,n. We set 


b:= 
1 THA, Hae 


and define the diagonal matrix 


D := diag(1,6,62,...,6"~!) 


and 


with the inverse 
D~* = diag(1,6~',6~*,...,6~"*?). 
Then for C := D~!BD we have that 


bi, dbje 67b13 . 6"—-1b,, 
bee 6bo3 =. ~C; 6” 2 bon 
C= b33 Lo. 63 bs, 


Onn 
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Since 6 < 1, by Theorem 3.26, we can estimate 


Clloo S,max | |bjj| + (n — 1)66 < plA) +. 


After setting V := QD we define a norm on ©” by ||z|| := ||V~!2]|,.. Using 
C = V—'AV we now obtain 


||Aa]| = |]V~* Aalloo = |]CV~* alloc < [IC lloollV~*tlloo = {IC lloollzl| 
for all z € ©”. Hence 
Al] < [|Clloo < p(A) +, 


and the proof is finished. O 


3.5 Completeness 


Definition 3.33 A sequence (z,) of elements in a normed space X is 
called a Cauchy sequence if for every € > 0 there exists an integer N(e) 
such that 

IIZn — Lml] < € 
for alln,m > N(e), «e., if limp m—oo ||[2n — Lm|| = 0. 
Theorem 3.34 Every convergent sequence is a Cauchy sequence. 


Proof. Let £n > £,n — oo. Then, for € > 0 there exists N(e) € IN such 
that ||z, — z|| < ¢/2 for all n > N(e). Now the triangle inequality yields 


Zn — Fml| = [len — © +o — Fm] < [len — al] + [la - tml <e€ 


for alln,m > N(e). O 


The fact that the converse of Theorem 3.34 is not true in general gives 
rise to the following definition. 


Definition 3.35 A subset U of a normed space X is called complete if 
every Cauchy sequence of elements in U converges to an element in U. A 
normed space is called a Banach space if it is complete. A pre-Hilbert space 
is called a Hilbert space if it 1s complete. 


The subset of rational numbers is not complete in IR. In order to give 
further examples, we introduce some infinite-dimensional normed spaces. 

The set Cla, b] of continuous functions f : [a,b] — IR equipped with 
pointwise addition and scalar multiplication, 


(f + 9)(2) = f(z) +9(@), (af)(x) = af(2), 


obviously is a linear space. Since the monomials z +> z”,n = 0,1,..., are 
linearly independent (see Theorem 8.2), C'[a, b] has infinite dimension. 
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Example 3.36 The linear space C[a, b| furnished with the maximum norm 


floc == max. |f(c)| 


is a Banach space. 


Proof. The norm axioms (N1)-(N38) are trivially satisfied. The triangle in- 
equality follows from 


If + glloo = max (Ff + 9)(2)| = CF + 9)(@0)| < [F(zo)| + l9(zo)| 


< max |f(zx)|+ max |9(z)| = |lflloo + Ilglloo 
z€[a,b] z€[a,b] 


for some Xo € [a,b]. Since the condition ||f, — flo < € is equivalent to 
lfn(x) — f(x)| < e for all x € [a, b], convergence of a sequence of continuous 
functions in the maximum norm is equivalent to uniform convergence on 
[a, b]. Since the Cauchy criterion is sufficient for uniform convergence of a 
sequence of continuous functions to a continuous limit function, the space 
Ca, b] is complete with respect to the maximum norm. Oo 


Example 3.37 The linear space Ca, b] equipped with the L, norm 


b 
fll -= / If (2)| dex 


is not complete. 


Proof. The norm axioms are trivially satisfied. Without loss of generality 
we take [a, b] = [0,2] and choose 


Then for m > n we have that 
: 1 
fn — Sm = ff (2" =a) dz < 40, n 00, 
0 n+1 


and therefore (f,) is a Cauchy sequence. Now we assume that (f,) con- 
verges with respect to the L; norm to a continuous function f; i.e., 


lfn — fla > 9, nm — OO. 


Then 


1 1 1 
[ eae < f f(a) ~ a|de+ [ a" de < If — falh + > 0 
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for n —> oo, whence f(x) = 0 follows for 0 < x < 1. Furthermore, we have 


2 2 
| \#@)- tae = f f(x) — fa(a)|dx < If — fall: 30, 00. 


This implies that f(a) = 1 for 1 < 2 < 2, and we have a contradiction, 
since f is continuous. 

However, we note that the space L1[a,b] of measurable and Lebesgue 
integrable real-valued functions is complete with respect to the L, norm 
(see [5, 51, 59]). O 


Example 3.38 The linear space Cla, b] equipped with the Lz norm 
; 1/2 
fle = ( / see) 


Proof. The norm is generated by the scalar product 


is not complete. 


b 
(f,9) =| f(x)g(a) dx. 


Considering the same sequence as in Example 3.37, it can be seen that 
C'[a, b] also is not complete with respect to the Zz norm. Again note that 
the space L?[a, b] of measurable and Lebesgue square-integrable real-valued 
functions is complete with respect to the Lz norm (see [5, 51, 59]). O 


Theorem 3.39 Each finite-dimensional normed space is a Banach space. 


Proof. Let X be finite-dimensional with basis u;,...,u, and assume that 
(z,) is a Cauchy sequence in X. We represent 


n 
Ly = ) AjyU; 
j=l 


and recall from Theorem 3.8 that there exists C' > 0 such that 


max lajy — Ajp| < Clie, — zy 


j=l 
for all v,u € IN. Hence for 7 = 1,...,n the (a;,) are Cauchy sequences 
in ©. Therefore, there exist a1,...,@, such that aj, + aj, v — o, for 
7 =1,...,n, since the Cauchy criterion is sufficient for convergence in C. 


Then we have convergence, 


n 
ty a2:= > aju; EX, vo, 
j=1 


and the proof is finished. O 
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Remark 3.40 Complete sets are closed, and each closed subset of a com- 
plete subset 1s complete. 


Proof. This is trivial. 0 


3.6 The Banach Fixed Point Theorem 


Definition 3.41 Let U be a subset of a normed space X. An operator 
A:U - X is called a contraction operator if there exists a constant 
q € [0,1) such that 

|| Az — Ayll < alla — y|| 


for all x,y € U. Each constant q satisfying this inequality is called a con- 
traction number of the operator A. 


Frequently, we will call a contraction operator simply a contraction. 
Remark 3.42 Each contraction operator 1s continuous. 


Proof. This is trivial, since the convergence ||z, — z|| > 0, n — oo, implies 
that ||Az, — Az|| < q||zn — z|| > 0, n > oo. O 


An operator A: U > X is called Lipschitz continuous with Lipschitz 
constant L if there exists a positive constant L such that 


|| Az — Ay|| < Ll — y|| 


for all x,y € U. Thus, contraction operators are Lipschitz continuous op- 
erators with Lipschitz constant less than one. 


Definition 3.43 An element x of a normed space X is called a fixed point 
of an operator A:U CX > X if 


Ax = f. 
Theorem 3.44 Each contraction operator has at most one fired point. 


Proof. Assume that x and y are two different fixed points of the contraction 
operator A. Then 


OF ||z — y|| = [| Ax — Ay|| < q|lz — yl, 


whence 1 < gq follows. This is a contradiction to the fact that A is a con- 
traction operator. 0 


Theorem 3.45 (Banach) Let U be a complete subset of a normed space 
X and let A:U > U be a contraction operator. Then A has a unique fixed 
point. 
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Proof. Starting from an arbitrary element xo € U we define a sequence (zn) 
in U by the recursion 


Inti := An, n=0,1,2,.... 
Then we have 
Zn41 ~ Fnll = ||Atn — Atn-i|| < allen — Zn-1l], 
and from this we deduce by induction that 
I2nt1 — Fri] <q" ||z1 — oll, n=1,2,.... 


Hence, for m > n, by the triangle inequality and the geometric series it 
follows that 


[Zn — Em|| < \|Zn — En+1|| + lan+1 _ In+2|| tee |2m—1 — Im 
(3.12) 


_ q” 
<(qrtqrti +---+q™")|lzi1 — zoll < 7 [z1 — Zoll. 


— q 
Since gq” — 0, n — oo, this implies that (z,) is a Cauchy sequence, and 
therefore because U is complete there exists an element x € U such that 
Ln > £,n—- oo. Finally, the continuity of A from Remark 3.42 yields 


z= lim @p4, = lim Az, = Az; 
nu CO uw CO 


i.e., x is a fixed point of A. That this fixed point is unique we have already 
settled by Theorem 3.44. O 


The main importance of Banach’s fixed point theorem in numerical anal- 
ysis originates from its constructive proof. Besides establishing existence of 
a fixed point by the method of successive approximations, it also provides 
an algorithm for obtaining numerical approximations. And this algorithm 
is very easy to program because of its iterative nature. We explicitly state 
this in the following theorem. 


Theorem 3.46 Let A be a contraction operator with contraction constant 
q mapping a complete subset U of a normed space X into itself. Then the 
successive approximations 


Inti i= An, n=O0,1,2,..., 


with arbitrary Zo € U converge to the unique fixed point x of A. We have 
the a priori error estimate 


|v. — £o|| 


In —-—21< 
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and the a posteriori error estimate 


\|zn — a|| < |2n — Fn—1|| 


l—q 
for alln EIN. 


Proof. The a priori error estimate follows from (3.12) by passing to the 
limit m — oo. The a posteriori estimate follows from the a priori estimate 
applied with starting element zo = Zp_1. Oo 


The a priori estimate is used in order to obtain upper bounds on the 
number of iteration steps, which are necessary to achieve a desired accuracy. 
In order to guarantee that 


vn - 2] <e 
for a given accuracy €, by the a priori estimate we need 


Iné 
4 ap ee 
— Ing 


iterations, where € = (1 — q)é/||z1 — ro||. The smaller the contraction con- 
stant gq, the fewer iteration steps are required. The a posteriori estimate, 
which in general yields better estimates as compared with the a priori esti- 
mate, is used to check the accuracy during the computation and terminate 
the iterations when the required accuracy is reached. 
The property 
||Az — Ay] < || — yll 


for all x, y with « 4 y, which is weaker than the contraction property, is not 
sufficient in general to ensure the existence of a fixed point, as illustrated 
in the following example (see also Problem 3.18). 


Example 3.47 The function f : [0,00) — [0, 00) given by 


1 
f(z) = a+ 7 
as a consequence of 
ZE+tyt ry 
5 pan te = OO — 
fle) ~ fl) = ay OD 


fulfills the condition 
\f(z) — fy)| < |e -y| 

for xc # y. However, because of 

: > 0 

1+2 


for all x > 0, it does not have a fixed point. oO 
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We conclude this section by considering the special case of linear opera- 
tors, i.e., by considering the Neumann series (see Problem 3.16). 

Let A: X — Y be an operator mapping a set X into a set Y. If for 
each y € A(X) there is only one element x € X with Ax = y, then A is 
said to be injective and to have an inverse A~! : A(X) — X defined by 
A-'y := a. The inverse mapping satisfies A~1A = I on X and AA“! = 1 
on A(X), where J denotes the identity operator mapping each element into 
itself. If A(X) = Y, then the mapping is said to be surjective. The mapping 
is called bijective if it is injective and surjective, i.e., if the inverse mapping 
Al:Y +X exists. 


Theorem 3.48 Let B: X > X be a bounded linear operator on a Banach 
space X with ||B\| < 1, and let I: X — X denote the identity operator. 
Then I — B is byective; t.e., for each z € X the equation 


z-Br=z 
has a unique solution c € X. The successive approximations 
In41 = Bryznt+z, n=O0,1,2,..., 


with arbitrary x9 € X converge to this solution, and we have the a priori 
estimate 


|B ||” 
\ltn — || < — ss; Ile1 — oll 
" 1 — ||B|| 
and the a posteriori estimate 
|| BI 
— g|| << |e, - tn 
|Zn z|| > 4- all lan n i|| 


for alln € IN. Furthermore, the inverse operator (I — B)~' is bounded by 


1 


[-By\< 
HO NS TB 


Proof. For fixed, but arbitrary, z € X we define the operator A: X + X 
by 
Az:= Br+z2z, xrEX. 


Then we have 
|| Ax — Ay|| = || B(x — y)I] < I|BIl lz — yl 


for all z,y € X; i.e., A is a contraction with contraction number g = ||B]]. 
Now the statements of the theorem can be deduced from Theorem 3.46. 
With the starting element x29 = z the successive approximations lead to 


n 
Ln = ) Bz 
k=0 
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with the iterated operators B* : X — X defined recursively by B® := I 
and B* := BB*-! for k € IN. Hence, in view of Remark 3.25, we have 


al 
ell < Do BASH < III < Sk 


and therefore, since 2, > (I — B)~1z,n > o, it follows that 


- 2 
(0 - B) zi < 
1 — ||| 


for all z € X. O 


3.7 Best Approximation 


Definition 3.49 Let U Cc X be a subset of a normed space X and let 
we X. An element v € U is called a best approximation to w with respect 
to U if 


—yvll = inf _ 
jw — vl] = inf Iw ~ uff 
e., if v EU has smallest distance from w. 


Theorem 3.50 Let U be a finite-dimensional subspace of a normed space 
X. Then for every element in X there exists a best approximation with 
respect to U. 


Proof. Let w € X and choose a minimizing sequence (u,,) for w; i.e., un € U 
satisfies 
jw — wn|| + d:= inf ||w — ull, nm — oo. 
ucU 


Because of |lun|| < ||w — unl| + ||w|| the sequence (u,,) is bounded. By 
Theorem 3.11 the sequence (un) contains a convergent subsequence (uy,s)) 
with limit v € U. Then 


jw = ol] = fim |jw ~ uggll = a 


completes the proof. 0 


Theorem 3.51 Let U be a linear subspace of a pre-Hilbert space X. An 
element v is a best approximation to w € X with respect to U if and only 
if 

(w—v,u) =0 (3.13) 
for allu € U, i.e., if and only ifw—v 1. U. To each w € X there exists at 
most one best approximation with respect to U. 
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Proof. We begin by noting the equality 
l|w — ul]? = |lw — |? + 2Re(w — v,v — u) + |lv — ull’, (3.14) 


which is valid for all u,v € U. From this, sufficiency of the condition (3.13) 
is obvious, since U is a linear subspace. 

To establish the necessity we assume that v is a best approximation and 
(w — v, uo) # 0 for some up € U. Then, since U is a linear subspace, we 
may assume that (w — v, ug) € IR. Choosing 


(w — v, Uo) 


U=U+t uo 
||40 ||? 


from (3.14) we arrive at 


(w — U, uo)? 


|| eo ||? 


which contradicts the fact that v is a best approximation of w. 

Finally, assume that v; and v2 are best approximations. Then from (3.13) 
it follows that (w — v1,u, — v2) = 0 = (w — v2,U1 — ve). This implies 
(v, — V2,0, — V2) = 0, whence v; = ve follows. oO 


Jw — ull” = |[ew — vf|° - < |lw — I)’, 


Theorem 3.52 Let U be a complete linear subspace of a pre-Hilbert space 
X. Then to each element w € X there exists a unique best approximation 
with respect to U. The operator P: X + U mapping w € X onto its best 
approximation is a bounded linear operator with the properties 


P? =P and ||P\||=1. 
It is called the orthogonal projection from X onto U. 


Proof. Choose a sequence (uy) with 
2-2, 1 
[ww — un||“ < d° + 7 ne N, (3.15) 
where d := infyey ||w — ull. Then 
I|(w — Un) + (w — Um)||? + [lun — Um||? = 2|]w — unl|? + 2||w — uml? 


2 2 
< 4d? 4+—+— 
nr mm 


for all n,m € IN, and since i (un + Um) € U, it follows that 


2 2 1 a) 
n m 2 
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Hence, (un) is a Cauchy sequence, and since U is complete, there exists an 
element v € U such that un > v, n — oo. Passing to the limit n — oo 
in (3.15) shows that v is a best approximation of w with respect to U. 
Uniqueness of the best approximation follows from Theorem 3.51. 

Trivially, we have Pu = u for all u € U, and this implies P? = P. From 
(3.13) it can be deduced that P is a linear operator and that 


lJeu||? = [Pwo]? + [lw — Pull” > || Pull” 


for all w € X. Therefore, P is bounded with ||P|| < 1. From Remark 3.25 
and P? = P it follows that ||P|| > 1, which concludes the proof. Oo 


Corollary 3.53 Let U be a finite-dimensional linear subspace of a pre- 
Hilbert space X with basis u,,...,Un. The linear combination 


nr 
v= S- OpUR 
k=1 


is the best approzimation to w € X with respect to U if and only if the 


coefficients a),...,Qn satisfy the normal equations 
nr 
S| ag (ur, uy) — (w,u;), jg=l,...,n. (3.16) 
k=1 


Proof. The normal equations (3.16) obviously are equivalent to (3.13). O 


The normal equations for the best approximation in pre-Hilbert spaces 
provide further examples of systems of linear equations. The solution be- 
comes trivial if the basis uj,...,u, is orthonormal. 


Corollary 3.54 Let U be a finite-dimensional linear subspace of a pre- 
Hilbert space X with orthonormal basis u,,...,Un. Then the orthogonal 
projection operator is given by 


nm 


Pw = S| (w, uur, we xX. 
k=1 


Proof. This is trivial from either the orthogonality condition of Theorem 
3.51 or the normal equations of Corollary 3.53. O 


Problems 


3.1 Show that (3.1) defines a norm on €” for p > 1 and that 
lim ||z/lp = [|zloo 
p—oo 


for all 2 € C”. 
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3.2 Indicate the closed balls {r € IR? : ||z|lp < 1} for p = 1,2,00. What 
properties do they have in common? 


3.3 Show that (3.1) does not define a norm on (” for 0 < p < 1. 


3.4 For the @; and €. norms on ©” show that ||z|loo < ||z||1 < n/z[loo. 


3.5 Let X and Y be normed spaces with norms || - ||x and |] - ||y, respectively. 
Show that 

Cz, y Il == Mellx + Ilylly, 

I(x, DIL = (Mall + iyi )??, 


I(z, y)|| = max(||x]]x, llylly), 
for (x,y) € X x Y define norms on the product X x Y. 


3.6 Show that convergent sequences are bounded. 

3.7 Let (z,) be a sequence of elements of a normed space X. The series 
Dt 
k=1 


is called convergent if the sequence (S,,) of partial sums 


n 


S. = Ez 


k=1 


converges. The limit S = limp Sn is called the sum of the series. Show that 
in a Banach space X the convergence of the series 


oO 
S/ lIzall 
k=1 


is a sufficient condition for the convergence of the series an x, and that 


[o.@) CoO 
> tel] SD lleell 
k= k=1 


1 


3.8 A norm |j-||, on a linear space X is called stronger than a norm ||-||, if every 
sequence converging with respect to the norm || - ||, also converges with respect 
to the norm || - ||. Show that || - ||2 is stronger than || - ||, if and only if there 
exists a positive number C such that ||z||» < C||x||. for all  € X. Show that on 
C[a, b] the maximum norm is stronger than the Lz norm (and stronger than the 
L, norm). Construct a counterexample to demonstrate that the maximum norm 
and the Lz norm (and the maximum norm and the LZ; norm) are not equivalent. 


3.9 Show that in a normed space the operations of addition and multiplication 
by a scalar are continuous functions. Show that in a pre-Hilbert space the scalar 
product is a continuous function. 
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3.10 Show that a norm ||.|| on a linear space X is generated by a scalar product 
if and only if the parallelogram equality 


2 
llc + yl? + Ile — yl? = 2Cell? + Ilyll”) 


holds for all x,y € X. Show that the @; and €.. norms on ©” are not generated 
by scalar products. 


3.11 Let A be a positive definite n x n matrix and denote by (-, -) the Euclidean 
scalar product on ©”. Show that (Az, y) defines a scalar product on ©”. 


3.12 Show that eigenvectors of a matrix for different eigenvalues are linearly 
independent. 


3.13 Let X and Y be normed spaces and denote by L(X,Y) the linear space 
of all bounded linear operators A: X — Y. Show that L(X,Y) equipped with 


Al] == sup |jAz| 
lz ||=1 
again is a normed space and that L(X,Y) is a Banach space if Y is a Banach 
space. 


3.14 Let A: X ~ X denote an operator from a normed space X into itself. 

The iterated operators A” : X > X, n = 0,1,..., are defined recursively by 
A° = I and A” := AA"~! for n € NN. If A is bounded and linear, show that 
A || < |All”. 


3.15 Show that for n x n matrices A the series 


at 


fore) 
k=0 


oo” 


converges (with respect to any norm on €”), and denote the sum of the series by 
e“. Show that if \ is an eigenvalue of A, then e* is an eigenvalue of e”. 


3.16 Show that if B: X — X is a linear operator on a Banach space X with 
||B|| < 1, then the Neumann series 


oo 


\ BY =(I1- By! 


k=0 
converges in the Banach space L(X, X). 


3.17 Let U be a complete subset of a normed space X and let A: U —~ U be 
a continuous operator, and assume that A™ is a contraction for some m € NN. 
Show that A has a unique fixed point and that the successive approximations 
Inti := Afn, n=O0,1,..., with arbitrary xo € U converge to this fixed point. 


3.18 A subset U of a normed space X is called sequentially compact if each 
sequence from U contains a convergent subsequence with limit in U. Let U bea 
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complete and sequentially compact subset of a normed space X and let A: U —~ U 
be an operator with the property 


|| Az — Ay|] < |lz — yl 


for all x,y € U with x # y. Show that A has a unique fixed point and that the 
successive approximations 2n4+1 := An, n = 0,1,..., with arbitrary ro € U 
converge to this fixed point. 


3.19 Let {u, : n € IN} be an orthonormal system in a pre-Hilbert space X. 
Show that the following properties are equivalent: 


(a) span{un :n € IN} is dense in X 


(b) Each y € X can be expanded in a Fourier series 


P= Sy, Un )Un 
n=1 


(c) For each y € X we have Parseval’s equality 
loll? = S— |(y, un)”. 
n=l 


Show that properties (a)—(c) imply that 
(d) + = 0 is the only element in X with (z,u,) = 0 for all ne N, 
and that (a), (b), (c), and (d) are equivalent if X is a Hilbert space. 


3.20 Show that the best approximation to a function f € C[0,2z] in the L’ 
norm with respect to the space of trigonometric polynomials of degree at most n 
is given by the partial sum 


— 40 
(Pr f)(x) = 5 + Sa cos kx + Sb sinkz, «x € [0, 2z], 


k=1 k=1 


of the Fourier series of f with the Fourier coefficients 


27 20 
ap = + | f(z)coskadz, by = + | f(x) sin kx dz. 
T Jo T Jo 


4 


Iterative Methods for Linear Systems 


This chapter is devoted to applying the analysis developed in the previous 
chapter to the iterative solution of systems of linear equations. In particular, 
we will discuss in detail the Jacobi and the Gauss-Seidel iterations, which 
essentially go back to Gauss. In Supplementum Theoriae Combinationis 
Observationum Erroribus Minime Obnozia, published in 1822, Gauss used 
a variant of the Gauss-Seidel method for the solution of the linear systems 
arising through his least squares method, since they were too large for 
elimination methods. 

With the advent of computers the size of the linear systems that could 
be solved grew enormously, leading to the requirement of speedup of the 
convergence of the classical Jacobi and Gauss-Seidel iterations. In this 
context, we will introduce the reader to the idea of relaxation methods, 
including a typical example that illustrates the dramatic gain in the speed 
of convergence by overrelaxation. We will conclude the section with the idea 
of defect correction iteration and indicate its application to the very efficient 
solution of the large linear systems arising from the discretization of linear 
differential and integral equations by two-grid and multigrid methods. 


4.1 Jacobi and Gauss-Seidel Iterations 
We start by supplementing the sufficient condition of Theorem 3.48 for 


convergence of the method of successive approximations by establishing a 
necessary and sufficient condition for the finite-dimensional case. 
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Theorem 4.1 Let B be ann xn matriz. Then the successive approxima- 
tions 
fv41:= Br,y+z, v=0,1,2,..., 


converge for each z € ©” and each xo € C” if and only if 
p(B) <1 
for the spectral radius of B. 


Proof. If p(B) < 1, then by Theorem 3.32 there exists a norm || - || on C” 
such that ||B|| < 1. Now convergence follows from Theorem 3.48 together 
with the equivalence of all norms on €” according to Theorem 3.8. 
Conversely, suppose that convergence holds. If we assume that p(B) > 1, 
then there exists an eigenvalue \ of B with |A| > 1. Let x denote an as- 
sociated eigenvector. Then the successive iterations for the right-hand side 
z = x and the starting element ro = 2 lead to the divergent sequence 
ty = (d-4—-9 A*) x. This is a contradiction. O 


We note that Theorem 4.1 remains valid for bounded linear operators 
B: X — X in infinite-dimensional Banach spaces with the definition of 
the spectral radius appropriately modified. However, the proof requires a 
different and deeper analysis. 

For the iterative solution of a system of linear equations of the form 


Ar=y 


we distinguish different methods by the way in which the original system 
is transformed into an equivalent fixed-point form. We decompose A by 


A=D+A,+AR 
into a diagonal matrix 
D = diag(aj1,.-.-,@nn), 


a proper lower (left) triangular matrix 


(0) 
Q21 (0) 

Ar=| az3i az32 0 ; 
GQni Q@n2 - + An n-1 0 


and a proper upper (right) triangular matrix 


O ay2 G13 - . Gin 
0 Q23 . Q2n 
Ar = . 
0 Qn—1,n 
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We assume that all the diagonal entries of A are different from zero. Hence 
the inverse D~! of D exists. 

In the method attributed to Jacobi, which is sometimes also called the 
method of simultaneous displacements, the system Az = y is transformed 
into the equivalent form 


i —D~*(Ay + Ar)£ + Dy, 
and the latter is solved by successive approximations 
fy41:= —-D "(Ap +Ar)ty,+D'y, v=0,1,2,..., 


with arbitrarily chosen starting element x9. Written in components, one 
step of the Jacobi iteration scheme reads 


Theorem 4.2 Assume that the matriz A = (a;,) satisfies 


n 


Qi 
Goo = ma Fl <1 (4.1) 
j=1,...,n aj; 
k=1 
k#j 
or 
= ma 4.2 
q1 = max >| a, (4.2) 
wr 
or 
1/2 
ask |” 
qo := - <1. (4.3) 
jkai' V 
tk 


Then the Jacobi method, or method of simultaneous displacements, 


Ajk Yj . 
_ ) } J J _ _ 
Ly+1,j = a ek jg =1,...,n, y =0,1,2,..., 
ka-1 74 JJ 
kAj 


converges for each y € C” and each xo € C” to the the unique solution of 
Az = y (in any norm on C"). For p = 1,2, 00, if qu < 1, we have the a 
prior. error estimate 


ML 


lz, — tIlu < \|z1 — Folly 


1—- 
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and the a posteriori error estimate 


lev — alle < 


lz. — ty—1||p 


for allv ENN. 


Proof. The Jacobi matrix —D~!(A; + Ar) has diagonal entries zero and 
off-diagonal entries —a;,,/a;;. Hence by Theorem 3.26 we have 


|| — D“'(Az + Ar)|loo = Yoo; 
|| -— D~*(Ar + Ar)|l, = 11, 


|| - D“'(Ar + Ar)ll2 < @.- 


Now the assertion follows from Theorem 3.48. D 


Note that the sufficient convergence conditions (4.1)—(4.3) are not equiv- 
alent. Roughly speaking, each criterion ensures convergence if the diagonal 
entries of A are dominant. The condition (4.1) can also be written as 


S lajel <lajjl, f= 1,-..0; (4.4) 

k=1 

k#j 
i.e., the matrix A is required to be strictly row-diagonally dominant. From 
(4.2) it can be deduced (see Problem 4.4) that if 

Tm 

S/ laje| <lane|, k=1,...,n, (4.5) 

j=l 

j#k 


1.e., if the matrix A is strictly column-diagonally dominant, then the Jacobi 
iterations converge. 


For the Gauss-Seidel method, which is also known as the method of 
successive displacements, we proceed differently and transform Az = y via 


(D+ Az,)xe = —-ApRnrt+y 
into the equivalent form 
a= —(D+Az)'Argz+(D+Az)7'y, 
which is then solved by the successive approximations 


ty41 = —(D+ Az) Arty +(D+Az)7'y, v=0,1,2,..., 
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with arbitrarily chosen starting element zo. For the actual computations 
we rewrite this as 


(D + Ap)ty41 = —Arnty +y, v=0,1,2,..., 


and solve the linear system for z,41 with the lower triangular matrix D+ A, 
by forward substitution. This leads to the Gauss-Seidel iteration scheme 
in the following explicit form: 


j-1 n 
Ajk Qjk Yj . 
_ > j > j j _ 
Ly+i,j _- ~~ . Lv+l,k — . Lyk + a b] J — 1, eee >. 
ka 795 kaj1 299 O55 


Here and in the sequel empty sums have to be interpreted as zero. 

In the Jacobi iteration scheme all the components of the new approx- 
imation vector 2,41 are obtained by using only the components of the 
previous approximation vector z,, which explains why this method is also 
called the method of simultaneous displacements. However, in the Gauss— 
Seidel iterations each new component of r,41 is immediately used in the 
computation of the next component; i.e., for computing the jth compo- 
nent %,+41,;, the values 7,411, %)+1,2,---,%v+1,j-1 are already used. This 
is very convenient for computer calculations, since the new values can be 
stored in the locations held by the old values, which reduces the storage 
requirements. 


Theorem 4.3 Assume that the matrit A = (aj;x) fulfills the Sassenfeld 
criterion 
p:= max p; <1, 
j=1,...,n 


where the numbers p; are recursively defined by 


“la J la “~. la 
_ 1k _ jk jk 
P1 =) a ’ pj =) Qs. Dk + S Qs ’ j = 2,...,N. 
ka-2 bl k=1 | ~99 k=jt1!°99 


Then the Gauss-Seidel method, or method of successive displacements, 
ra “. a y 
_ jk 5 jk a _ 
tr4ij =—- > B41 k- — Lv,k —, J =1,...,n, y=0,1,2,..., 
ka 295 ka=j+i 299 jj 


converges for each y € ©” and each xo € C” to the the unique solution of 
Az =y (in any norm on C"). We have the a priori error estimate 


pY 
2» alloo STP [ler ~ zolloe 


and the a posteriori error estimate 


lz, — Zlloo < 


a D |r, — Ly—1|loo 


for allv EN. 
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Proof. Consider the equation 
(D+ Az)x = —Apz 
for z € C” with ||z||.. = 1, that is, 
— Ajk Ajk ; 
Ly = ag te — » aj; 7=1,...,n. 
k=1 k=j+ 


By induction, this implies that |z;| < p; for 7 = 1,...,n, and therefore 
|z|]oo < p. Hence we have 


||(D + Ar)" Arlloo < p, 
and the assertion of the theorem follows from Theorem 3.48. O 


Corollary 4.4 Assume that the matriz A is strictly row-diagonally domi- 
nant. Then the Gauss—Seidel iterations converge. 


Example 4.5 The tridiagonal matriz 


2 -1 
—1 2 -1 
A= —1 2 -l 
—1 2 —1 
—1 2 


from Example 2.1 is not strictly row-diagonally dominant, but it satisfies 
the Sassenfeld criterion. 


Proof. Obviously, qo. = 1; i.e., (4.1) is not fulfilled. We have the recursion 


1 1 1 
Pi=5> Pj = 5 Pi-1 + 5» j =2,...,n—-1, Pn = 5 Pn-1- 


From this, by induction, it follows that 


1 
pp=1-=, j=l,....n-1, pr= on 


KO] — 


Therefore, 


p=1- <1, 


1 
Qn—1 
and this implies convergence of the Gauss-Seidel iterations by Theorem 
4.3. oO 
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Since the matrix A is tridiagonal, the system Az = y can be solved 
efficiently by elimination (see Problem 2.9). Nevertheless, this matrix pro- 
vides a very suitable example for the analysis of iterative methods for lin- 
ear systems arising in the discretization of ordinary and partial differential 
equations. This is due to the fact that in more general cases, for exam- 
ple for the linear system of Example 2.2, there are more technical details 
to consider, which distract from the basic principles. However, these basic 
principles do not depend on the dimension of the underlying differential 
equation problem. 

In Example 4.5, if n is large, the contraction number p will be close to 
one, i.e., the convergence rate of the Gauss-Seidel iterations will be unsat- 
isfactorily slow. Before we indicate how the convergence can be accelerated, 
we continue by discussing a weaker form of row-diagonal dominance. 


Definition 4.6 Ann xn matriz A = (a;,) is called reducible if there exist 
two nonempty sets N,M C {1,...,n} such that 


NOM=90, NUM =(l....,n}, 


and 
aje=0, JEN, KEM. 


Otherwise the matriz is called irreducible. 


A reducible matrix A, after a reordering of the rows and columns, can 
be partitioned into a 2 x 2 block matrix of the form 


Ai 0 
A= 
( Agi A22 ) 
(see Problem 4.5). Therefore, solving a linear system with the matrix A 


can be reduced to solving two smaller linear systems with the matrices A, 
and Ago. 


Theorem 4.7 Assume that the matrit A = (a;,) is irreducible and weakly 
row-diagonally dominant; i.e., A is row-diagonally dominant, 


nr 

S lajel < lays, fg =1,-..50, (4.6) 

k=1 

kFj 
with inequality holding for at least one row j. Then the Jacobi iterations 
converge for each y € ©” and each x € C€” to the unique solution of 
Az = y (in any norm on C”). 


Proof. By (4.6) and Theorem 3.26 we have that ||B||,. < 1 for the Jacobi 
matrix B = —D~'(A;,+ Ap). Therefore, from Theorem 3.32 it follows that 
p(B) < 1 for the spectral radius. 
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Now assume that there exists an eigenvalue A of B with |A| = 1. For the 


associated eigenvector we may assume that ||x||,. = 1. Then from Az = Br 
we obtain the inequality 


nm mr 
Al zjl << S° jal < >- 
k=1 k=1 


kA; kAj 


<1, j=l,...jn. (47) 


jk 
aij 


jk 
jj 


Let N := {7 : |z;| = 1}. Since ||z||.o = 1, we have that N # @. For 7 € N 
we have |A||z;| = 1, and therefore equality holds in (4.7); i-e., 


Ck) GEN, 


a o- 
ha! 7 
k#j 


From this it follows that 
M := {1,...,n}\N £9, 


since A is weakly row-diagonally dominant. Because A is irreducible, there 
exists jo € N and ko € M such that a,,4, 4 0. Now by using 


|Qjoko| |Z keg | < IAjoko | 


we obtain the contradiction 


n rn 
Qjok Qiok 
1 = |ejo] = lAlleiol =D || eal < Df] <1. 
k=1 Jojo k=1 JOJO 
kAj k#j 


Therefore, we have p(B) < 1, and the statement of the theorem follows 
from Theorem 4.1. O 


We leave it to the reader as an exercise to show that the matrix A 
from Example 4.5 is irreducible and weakly row-diagonally dominant (see 
Problem 4.6), implying convergence of the Jacobi iterations. 


4.2 Relaxation Methods 


From combining the a priori error estimate of Theorem 3.48 with Theorem 
4.1 we see that the spectral radius p(B) of the iteration matrix B may 
be considered as a measure for the speed of convergence of the successive 
approximations. Therefore, it is desirable to design the iterative scheme 
such that p(B) becomes small. This aim is the motivation of the relaxation 
methods to be discussed in this section. 
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Each step of the Jacobi iterations can be written in the form 
ty41 =2,p+D‘(y— Az,), 


indicating how the new approximation z,+4, is obtained by correcting the 
previous approximation x,. The basic idea of the relaxation methods is 
to multiply the correction term by some weight factor. Note that if the 
following relaxation iterations converge, then they converge to a solution 
of Az = y. 


Definition 4.8 The iterative scheme 
Ty41 i= ty +wD'(y—Az,), v=0,1,2,..., 

1.€., nm components 

Ww nr 

Lyti,j = Vv,j + Yj — So ajetys ) j — 1,...,7, 

“33 k=1 
is known as the Jacobi method with relaxation. The weight factor w > 0 1s 
called the relaxation parameter. 


Theorem 4.9 Assume that the Jacobi matrit B:= —D~'(Ar, + Ar) has 
real eigenvalues and spectral radius less than one. Then the spectral radius 
of the iteration matriz 


I-wD1A=(1—w)I —wD7!(Ay, + Ar) 


for the Jacobi method with relaxation becomes minimal for the relaxation 


parameter 
2 


Wopt = 
2— Amax — Amin 


and has spectral radius 


_ Am x Amin 
pl ~~ WoptD *A) = y) _ Ln _ Amin 5) 


where Amin and \max denote the smallest and the largest eigenvalue of B, 
respectively. In the case Amin # —Amax the convergence of the Jacobi method 
with optimal relaxation parameter is faster than the convergence of the 
Jacobt method without relaxation. 


Proof. For w > 0 the equation Bu = Au is equivalent to 
(1 —w)f+wBlu = [l-—w+waAlu. 


Hence the eigenvalues \ of B correspond to the eigenvalues 1 — w + wA of 
(1 —w)I +wB. Therefore, the eigenvalues of (1 — w)I + wB are real, and 
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the smallest eigenvalue of (1 —w)I+wB is given by 1-~w+wAmin and the 
largest by 1 —w+wAmax. Obviously, the spectral radius becomes minimal 
if the smallest and the largest eigenvalue are of opposite sign and have the 
same absolute value, i.e., if 


1 — Wopt + WoptAmin =—I+ Wopt — WoptAmax: 
From this, elementary algebra yields the optimal parameter wop, and the 


spectral radius p(I — woptD~' A) as stated in the theorem. Oo 


For the Gauss-Seidel iterations, from (D + Az)%,4, = —Arty + y it 
follows that 


Ly4+1 = Ly + D"ly _ AL@y41 — (D + Ar)zy]. 


Hence, the corresponding relaxation method is defined as follows. Note 
again that if the relaxation iterations converge, then they converge to a 
solution of Ax = y. 


Definition 4.10 The iterative scheme 
Gy41 = 2yp+wD "ly — Apty4, —(D + Ar)z,], v=0,1,2,..., 


1.€., In components 


jl n 
W rT 
Ty+1,j = Tug FO | Us — » Oj Ly 41k — 5 Ojetry,~|, j=l,...,n, 


is known as the Gauss-Seidel method with relaxation or as the successive 
overrelaxation (SOR) method with relaration coefficient w > 0. 


From 
(D+ wAz)ty41 = wy + [(1 —w)D —wArlo, 


we obtain that the iteration matrix of the SOR method is given by 
B(w) := (D+wAz)7*[(1 —w)D — wApl. 


Here, as opposed to the relaxation of the Jacobi method, the iteration 
matrix depends nonlinearly on the relaxation parameter. This makes the 
convergence analysis of the SOR method more complicated. 


Theorem 4.11 (Kahan) A necessary condition for the SOR method to 
be convergent is that 0 < w < 2. 


Proof. Since the eigenvalues j1,..-, [Mn of B(w) are the zeros of the char- 
acteristic polynomial, they satisfy 


pj = det B(w) 


—: 


1 


J 
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(where multiple eigenvalues are repeated according to their algebraic mul- 
tiplicity). From this, by the multiplication rules for determinants and since 
D+wAy and (1 —w)D —waAgp are triangular matrices, it follows that 


bj = det(D + wAz)~* det[(1 — w)D —wAp] = (1—-w)”. 


iF 


1 


Jj 


This now implies 
p[B(w)] > |1 — wf, 


and from Theorem 4.1 we conclude the necessity of 0 < w < 2 for conver- 
gence. O 


Theorem 4.12 (Ostrowski) If A is Hermitian and positive definite, then 
the SOR method converges for all x9 € C", ally € C”, and all0 <w <2 
to the unique solution of Ax = y. 


Proof. Let p be an eigenvalue of B(w) with eigenvector 7; i.e., 
(1 —w)D —wAp|z = p(D+wAz)z. 
With the aid of 
(2—w)D —wA—w(Apr — AL) = 2[((1 —w)D — wAR] 


and 
(2 — w)D +wA— w(Ar — Az) = 2[D + wAz] 


we deduce that 
[((2 —w)D —wA — w(ArR — Az) 2 = p[(2 —w)D + WA — w(ApR — Az) IZ. 
Taking the Euclidean scalar product with z, it now follows that 


(2 —w)d—-wa+tiws 
(2—-w)d+wa+t iws ’ 


where we have set 
a:=(Az,z), d:=(Dz,2r), $s:=i(Agz— Azpxz,z). 


Since A is positive definite, we have a > 0 and d > O, and since A is 
Hermitean, s is real. From 


|(2 — w)d — wal < |(2 -—w)d+ wal 


for 0 < w < 2 we now can conclude that |u| < 1 for 0 < w < 2. Hence 
convergence of the SOR method for 0 < w < 2 follows from Theorem 4.1. O 
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The calculation of the optimal relaxation parameter, i.e., the parameter 
minimizing the spectral radius, is difficult except in some simple cases. 
Usually it is obtained only approximately by trial and error, based on trying 
several values of w and observing the effect on the speed of convergence. 
However, the effort is well worth the time, since the resulting improvement 
of the convergence can be considerably large, as we will indicate by the 
following analysis, which relates the convergence of the SOR method to 
that of the Jacobi method for a certain class of matrices that occurs in the 
discretization of boundary value problems. 


Definition 4.13 A matrix A= D+ Ar, + Apr with nonsingular diagonal 
D is called consistently ordered if the eigenvalues of 


1 
C(a) := -aD7' A, - " D "Ar, a€C\ {0}, 
do not depend on a. 


The following theorem ensures that the analysis we are going to develop 
applies to the matrix of Example 2.1, i.e., of Example 4.5. 


Remark 4.14 Tridiagonal matrices with nonzero diagonal elements are 
consistently ordered. 


Proof. After introducing the diagonal matrix 
S(a) := diag(1,a,a7,...,a"~') 
for tridiagonal matrices A = D+ Ay, + Ar, we have that 
S(a)C(1)S(a)~* = C(a); 


i.e., all matrices C'(@) are similar, and therefore they have the same eigen- 
values. O 


Without going into detail, we wish to say that a much wider class of 
matrices arising in the discretization of differential equations enjoys the 
property of being consistently ordered in the sense of Definition 4.13. For 
a more comprehensive study we refer to [61, 63, 66]. 


Theorem 4.15 (Young) Assume that A is a consistently ordered matriz 
and that the eigenvalues of the Jacobi matrix —D~!(Az, + Apr) are real 
with spectral radius A = p[-D~'(Ar, + Ar)] < 1. Then the SOR method 
converges for all0 < w < 2. The spectral radius of the SOR matrix B(w) 
1s minimal for 


2 
Yo TV = 
In this case we have 
p{B(wopt)] = tata 
1+V1—-& 
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Proof. From 
(I +wD-!Az)[pl — B(w)] = wI+wD-'Az,) — D1[(1 —w)D — wAR] 


1 
=(pt+tw-1)l+Jpw (vii DA, +— DAR) 
vil 
and the fact that I +wD~'!A,z is nonsingular it can be seen that p £ 0 is 
an eigenvalue of B(w) if and only if 


_ ptw-l 


is an eigenvalue of 
—J/p D* Az - — D~"Ap. 
Vi 
Since A is assumed to be consistently ordered, it follows that p 4 0 is an 


eigenvalue of B(w) if and only if A is an eigenvalue of —D~!(Az + Ap). 
Solving the quadratic equation 


w+tw-l=JSpwr 


yields 
2 
wr wr 
=— | —+ _ 
Setting a = —1 in Definition 4.13, it is obvious that if A is an eigenvalue 


of —D~'(A, + Ar), then —X also is an eigenvalue of —D~!(Az, + Ap). 
Therefore, since we are interested only in the spectral radius of B(w), we 
can confine our considerations to 


2 
(1 Ww? \? 
e=)— + +1l—-w]. 


2 4 


Because of |A| < 1, the quadratic equation 
wd* —4w +4 =0 


has two real solutions, and only one of them belongs to the interval (0,2), 


namely 
2 
wo(A) = ——————= > 1. 
WN) = Tae > 


This implies that 


wd? —4u+4>0, O<w <up(Qd). 
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Therefore, we have 


2 
|pu(w)| = (al r1-w] , O<w<wo(A). (4.9) 


For wo(A) < w < 2 the eigenvalues are complex, with 
lu(w)] =w—1, wo(A)<w <2. (4.10) 


From the expressions (4.9) and (4.10) it can be seen that |j1(w)| is mono- 
tonically nondecreasing with respect to |A|. Hence 


(Ss w A2 
p[B(w)] = ? 4 


w — 1, wo(A) <w < 2. 


2 
r1-w] ; 0O<w<up(A), 
(4.11) 


The function 
A 2A2 
fw) => + ~~ +1l-—w 


has the properties f(0) = 1 and 


A wA? —2 
, Ww) = — + ——_—_—_———. 
fw) 2 = 2Ww*A2 +4 —- 4y 
The latter follows from 
A?(4 — 4w + w?A*) < 4 — 4wA? +.w*A* = (2 —wA?)?. 


Therefore, the spectral radius described by (4.11) is strictly monotonically 
decreasing for 0 < w < wp and strictly monotonically increasing for 
wo <w < 2 (see Figure 4.1). Since p[B(0)| = p[B(2)] = 1, we finally 
obtain that p[B(w)] < 1 for all 0 < w < 2 and that p[B(w)] assumes its 
minimum for w = wo(A) with value p[B(wo(A))] = wo(A) — 1. O 


< 0. 


Corollary 4.16 Under the assumptions of Theorem 4.15 the Gauss—Seidel 
method converges twice as fast as the Jacobi method. 


Proof. From (4.8) we observe that yp = 47 for w = 1; i.e., we have 
p[B(1)] = {e[-—D7"(Ar + Ar)]}* 


for the spectral radii of the Gauss-Seidel matrix B(1) and the Jacobi ma- 
trix —D~'(A, + Ar). Now the statement follows from the observation that 
by the a priori estimate of Theorem 3.48 the number WN of iterations re- 
quired for a desired accuracy is inversely proportional to the modulus of 


the logarithm of the spectral radius; i.e., 
N(Gauss-Seidel) | Inp[—D~*(Ar + Ar)] _ 1 


N(Jacobi) in p[B(1)] ~ 2? 


and this proves the assertion. 0 
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p|B(w)] 


1 0 2 WwW 
FIGURE 4.1. Spectral radius for SOR 


Example 4.17 For the tridiagonal matrix A from Example 4.5 we have 
N(SOR) oat 


N(Jacobi) ~ 4(n + 1) 
for the optimal relazation parameter. 


Proof. Using the trigonometric addition theorem 


it can be seen that the Jacobi matrix 
0 1 
1 0 1 
—D(4,+An)=5) 1 
1 01 
1 0 
corresponding to Example 4.5 has the eigenvalues 
Aj = cos », jH=l,...,n, 
and associated eigenvectors v; with components 
via = sin I =1],...,.n, jg=1,...,n. 


Hence, 


A= p[-D™ = cos —— 2 1 —- — 
Al (Ar + An)| = cos "> 1 aap 
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and 
Inp|—D-!(Ar + A wm 
— In p[- (Ar + WIS Saya 
From Theorem 4.15 we obtain 
2 
Wopt = T 
1+ sin 
n+l 
and 
1 — sin 
201 
B 7) —_ ntl ~~) 1 nn 
pl ( opt )] 1+sin TV n+ 1 
n+l 
whence ; 
1 
—InpiB ~~ — 
n p| (Wopt)| n+ 1 
follows. This concludes the proof. O 


For example, for n = 30 the optimal SOR method is about forty times 
as fast as the Jacobi method. Note that the improvement on the speed 
of convergence improves as n increases. The fact that in Example 4.17, 
and, more generally, in almost all linear systems arising in the discretiza- 
tion of boundary value problems, the optimal relaxation parameter has the 
property w > 1 explains why the method is known as the overrelaxation 
method. 


4.3. Two-Grid Methods 


Consider the linear system 
Ar=y (4.12) 


with a nonsingular matrix A, and assume that we already have an approx- 
imate solution Zo available with a residual, or defect, 


ro := y — AZo, 
for which, in general, ro 4 0. Then we try to improve on the accuracy by 
writing 
L1 = 2X0 + do (4.13) 


with some correction term 69. Substituting this into (4.12) we obtain that 
69 has to satisfy the defect correction equation 


Ado = TO 
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in order that x; satisfy (4.12). We observe that the correction term do will, 
in general, be small compared to Zo, and therefore it is unnecessary to solve 
the defect correction equation exactly. Hence we write 


60 = Axsorox?0: 


approx 


where Az) prox is some approximation for the inverse A7! of A. Substituting 


this into (4.13) we obtain 


£1 = Xo + Azpprox LY — Azo| — (I _ Azpprox A) £0 + AxpproxY (4.14) 
as our new approximate solution to (4.12). This procedure is known as the 
defect correction principle. 

Repeating this process yields the defect correction iteration defined by 


Cy412= ty + Avproxly — Avr], v=90,1,2,..., (4.15) 


for the solution of (4.12). By Theorem 4.1, the iteration (4.15) converges 
to the unique solution z of Ajpproxly — Az] = 0, provided that the spectral 
radius of the iteration matrix I— Aj) 10xA is less than one. Since the unique 
solution « of Ax = y trivially satisfies AZ), ,oxly — Az] = 0, we then have 
convergence of the scheme (4.15) to the unique solution of (4.12). For a 
rapid convergence it is desirable that the spectral radius be close to zero, 
which will be the case if Az), ,0x is a reasonable approximation to A~!. For 
a more complete introduction to the defect correction principle we refer to 
[56]. 

Here we wish to indicate briefly two applications. Firstly, the defect cor- 
rection principle (4.14) can be used to improve on the accuracy of an 
approximate solution x9, obtained for example by Gaussian elimination. 
Then, in principle, the computation of x9 corresponds to some approxima- 
tion 29 = Axpproxy obtained from an LR decomposition. This means that 
evaluating 69 = Ajpprox’o is achieved by applying again the same elimi- 
nation algorithm to the defect correction equation. This way, the defect 
correction principle provides a simple tool to improve on the accuracy of a 
solution to a linear system obtained by elimination. 

Secondly, we would like to illustrate the more systematic use of the defect 
correction principle for the development of multigrid methods as a powerful 
tool for the fast iterative solution of linear systems arising in the discretiza- 
tion of differential and integral equations. For the sake of simplicity we will 
confine ourselves to the case of two-grid iterations. 

The basic idea of two-grid methods is to use the defect correction princi- 
ple with the approximate inverse A3).,,ox for the matrix Agne of a large lin- 
ear system corresponding to a fine approximation grid given simply by the 
exact inverse of the matrix Agoarse Of a smaller linear system, correspond- 
ing to a coarse approximation grid. Of course, a number of mathematical 
problems arise in the design of such methods concerning the appropriate 
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relation between the fine and coarse grid and the transfer between the two 
grids. We will outline some ideas on the structure of two-grid methods by 
again considering the simple model problem from Example 2.1 as a typical 
case. 

Recall that the solution vector U“) € IR” of the linear system 


AMyth) — plh) (4.16) 


with the n x n tridiagonal matrix 


2 -1l 
—-l1 2 -1 
1 —] 2 —1 
A) 

At #2 . ; . 
—-1 2 -1l 

—-1 2 

corresponds to approximate values ul”) ~ u(jh), 7 = 1,...,n, for the 


solution u of the boundary value problem (2.1)—(2.2) at the internal grid 
points. Since we want to make use of two different grids in our analysis, we 
indicate the dependence on the mesh width 


1 
n+1 


—— 


in the matrix A“) and the solution U'). We assume that n is odd because 
later we want to choose the coarser grid by doubling the mesh width. 
We start from the Jacobi iteration with relaxation 
u™), =u — yp) au — FP), »=0,1,2,..., (4.17) 
as introduced in Definition 4.8. From our analysis in Example 4.17 we 
deduce that A‘) has the n eigenvalues 
4 in? TIP 


Hj = py Sin” j3=1,...,n, (4.18) 


; ; h) _. 
and associated eigenvectors u! ) with components 


ve =sin(mjkh), k=1,...,n, jg=l,...,n. (4.19) 


Note that by Theorem 3.29, the eigenvectors of the Hermitian matrix Afr) 
form an orthogonal basis for IR” (see Problem 4.18). The yi) Jj =1,...,n, 


are also eigenvectors of the Jacobi matrix J—[D\]-! A“), with eigenvalues 


A; =cos(mjh), jg=l,...,n. 
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From Theorem 4.9 we observe that w = 1 is the optimal choice for the 

Jacobi iteration with relaxation. However, it will turn out that in the con- 

text of two-grid methods the damped, or underrelared, Jacobi method with 

O<w<l s more important. This is due to the following observation. 
(h 


Since the vj", j = 1,...,n, provide a basis for IR", we can represent the 


difference between the exact solution U‘™ and the vth iteration U, in the 
form 


ulh) — U, = 3 aj ve”. 


j=l 


From the fact that 


ih 
{I -w[D™} 1AM yy”) = {1 — 2w sin” “| wv, f= 1,... 


we derive the recurrence relation 
9 Wh . 
ajver = {1 —2wsin? hay, jg=1,...,n, 
for the coefficients a;,. In particular, if we choose w = 0.5, we have that 


ajh 

Qjy+1 = cos" — Qjv, jul,...,n. (4.20) 
From this we observe that even though convergence of the iterations (4.17) 
becomes slower when we decrease w, for w = 0.5 the convergence restricted 
to the subspace 


Wr i= span{U2ti,.--,Un} 


of high frequencies is dramatically accelerated, since in this case from (4.20) 
we have that 
n+1 


DT 


This fact can be expressed by saying that the damped Jacobi iteration is a 
smoothing iteration. In the sequel we will consider only the damping factor 
w = 0.5. 

The slow convergence with respect to low frequencies will now be taken 
care of by the defect correction principle through incorporating a so-called 
coarse grid correction on the grid with mesh width 2h. For this we need 
to transfer vectors corresponding to the fine grid to vectors correspond- 
ing to the coarse grid and vice versa. The transfer from the fine grid 
to the coarse grid requires a restriction and corresponds to a mapping 


1 , 
lajeil <5 lajel i= sn. 


R) : IR" > IR*=. Note that we only need to consider this mapping for 
the interior grid points. Instead of choosing the restriction (Ry), = yor, 
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k=1,..., 35+, for y € IR” it turns out to be advantageous to also incor- 
porate information contained in the odd nodal points of the fine grid by 
using the restriction 


n—-1 
9 3 


1 
(Ry), = 4 [Yor—1 + 2yok + Yorgi], K=1,..., 


as illustrated in Figure 4.2. 


bo 
—" 
— 
db 
-_— 
—_ 
tole 
— 


FIGURE 4.2. Restriction operator of the two-grid method for n = 7 


The corresponding matrix is 


1 
R®) = = 
4 


With the aid of elementary trigonometric manipulations one can establish 
the relation 


n—l 


Ry” = cay?h), Ry) = — 82") 7g=1,..., 


, (4.21) 
between the eigenvectors (4.19) for the fine and the coarse grid (see Problem 
4.19). Here we have set 

qth gah n—1 


Cj = COS ; $j = sin ; j=1,..., 


The transfer from the coarse grid to the fine grid is called prolongation 
nol 


and corresponds to a mapping P‘*) :IR-? — IR”. The simplest choice for 
P(*) is given by the piecewise linear interpolation (see Chapter 8) 


n—-l 
(Pylon = Ye, k=1,...,—-, 
1 n+l 
(Py)on—1 — 9 [yx + Yr—1\; k= ,. ao) 5) ; 
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for y € R=, as illustrated in Figure 4.3. The corresponding matrix is 
given by P(*) = 2R()+. Either by direct computation or from (4.21) and 
the fact that the matrices P(”) and 2R) are adjoint one can establish that 
(see Problem 4.19) 


—] 
Pl) yh) = cyl”) — stu) i, jg=1,..., ” 5 (4.22) 


FIGURE 4.3. Prolongation operator of the two-grid method for n = 7 


Now we are in a position to use the n x n matrix P\”)(A@")]-!R™) as 
the coarse-grid correction. Computing P‘”)[A@")]-!R(™y corresponds to 


first restricting the vector y € IR” to Ry e€ R=, then solving the 


rae x not system A(?4)z = Ry by an elimination method, and finally 


prolonging the solution z € R= to Pz € R". Combining this coarse- 
grid correction with N steps of the damped Jacobi iteration in the sense of 
(4.14) now yields one step of the two-grid iteration scheme 


Uvai = In(U,, F™) — PLAC) 2 ROAM Ty (UL, FO) FO), 


where Jy(U,, F‘\)) denotes the result of N steps of the damped Jacobi 
iterations (4.17) with starting element U,. Obviously, the iteration matrix 
corresponding to this two-grid method is given by 


N 
Ty = {I — PAC] —-1 RM 4} 1! ~ : pra} . (4.23) 


For an investigation of the convergence for our two-grid iteration scheme 
we need to determine the spectral radius of Ty. For simplicity we confine 
ourselves to the case where N = 1; 1.e., one step of the damped Jacobi itera- 
tion on the fine grid alternates with a coarse-grid correction by elimination 
on the coarse grid. We set 7; = T. 


Theorem 4.18 For the spectral radius of T we have that p(T) = 0.5; i.e., 
the two-grid iterations converge. 
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Proof. We note that from (4.18) and (4.19), with h replaced by 2h, we have 
that 


h 1, . h 4 
ARH) y(? )_ 7 sin2(mjh) v? )_ 5 0s? 12), 


whence 


2 
2h)j-1, (2h) _ Ao an) n-1 
JI 


follows. From this, using (4.20)—(4.22) and RO vas = 0, it can be derived 


that (n) (h) 
Ty" 13 (| 1) ( y\? 
h = §,;C, h (4.24) 
( re) 77\ 1 1 ol) 
for j =1,..., "5+ and 
1 
Tv), = 5 oi) (4.25) 
2 2 


Since the matrix 


has the eigenvalues 0 and 2, from (4.24) and (4.25) it can be seen that the 
matrix T’ has the eigenvalues 


n+l 
y) >) 


1 
2.2 _ ao) . - 
28565 = 5 sin*mjh, j7=1,..., 
and the eigenvalue zero of multiplicity Bek This implies the assertion on 
the spectral radius of T. O 


Theorem 4.18 shows that the two-grid method is a very fast iteration. As 
compared to the classical Jacobi and Gauss-Seidel methods and also to the 
SOR method with optimal relaxation parameter, it decreases the spectral 
radius from a value close to one to one-half, which causes a substantial 
increase in the speed of convergence. However, for practical computations 
it has the disadvantage that in each step the solution of a system with half 
the number of unknows is required. 

This drawback of the two-grid method is remedied by the multigrid 
method. Whereas for the two-grid method as described above only two 
grids are used, the multigrid method uses M > 2 different grids with mesh 
widths h, = 2"h,y = 1,...,M, obtained from the mesh width fh on the 
finest grid. The multigrid method is defined recursively. The method for 
M +1 grids performs one or several steps of the damped Jacobi iteration 
on the finest grid with mesh width h and uses as approximate inverse for 
the defect correction one or several steps of the multigrid iteration on the 
M grids with mesh widths 2h, 4h,...,2“h. To be more explicit, the three- 
grid method uses one or several steps of the two-grid method as the defect 
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correction of the damped Jacobi iteration on the finest grid; the four-grid 
method uses one or several steps of the three-grid method as the defect 
correction; and so on. To describe further details of the multigrid method, 
in particular showing that the computational cost of one step of a multigrid 
iteration is proportional to the cost of the Jacobi iterations on the finest 
grid provided that the coarsest grid is coarse enough, is beyond the aim of 
this introduction. For a comprehensive study we refer to [8, 26, 29, 63]. 


Problems 


4.1 Consider the solution of the linear system 
OL 1 — 243 = —1 
—47; + 8x2 + 273 = 18 
O22 + 9x3 = 37 


by the Jacobi method. Give an estimate on the number of iterations needed to 
ensure that ||z, — z||.. < 107° if the iteration is started with zo = (0,0,0)’. 


4.2 Write a computer program for the Jacobi method, the Gauss-Seidel method, 
and the SOR method and test it for various examples. 


4.3 Show that a matrix A has spectral radius p(A) < 1 if and only if it satisfies 
limy—+oo A’ = 0. 


4.4 Prove that the Jacobi method converges for strictly column-diagonally dom- 
inant matrices (compare (4.5)). 


4.5 Show that an n x n matrix A is reducible if and only if there exists an n x n 
permutation matrix P such that 


P-'AP = 
( Aa Ag ) 


where Aj; isa kxk matrix and Age is an (n—k) x (n—k) matrix with 1 < k < n—1. 


4.6 Show that the matrix A from Example 4.5 is irreducible and weakly row- 


diagonally dominant. 
laa 
A= a la. 
aa il 


Show that for 1 < 2a < 2 the Gauss-Seidel method is convergent and the Jacobi 
method is not. 


4.7 Let 
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1 —-2 2 
A= —] 1 -l 
—2 -2 1 


show that the Jacobi method is convergent and the Gauss-Seidel method is not. 


2 1 -l 
A={ -2 2 -2 
1 1 2 


show that the Gauss-Seidel method is convergent and the Jacobi method is not. 


4.8 For the matrix 


4.9 For the matrix 


4.10 Show that the matrix 
2 0 —-1 -1 
0 2 -1 -1 
—1 -1 2 0 
—1 -1l 0 2 


A= 


is irreducible and that the Jacobi method is not convergent. 


4.11 Show that the iteration matrix of the Gauss—Seidel method has eigenvalue 
zero. 


4.12 Consider the variant of the Gauss-Seidel iteration where the components 
are iterated from the nth component backward to the first component. What is 
the iteration matrix of this method? Obtain a symmetric method by alternating 
one step of the forward Gauss-Seidel method and one step of the backward 
Gauss-Seidel method. What is the iteration matrix of this method? 


4.13 Show that the Jacobi iteration converges for a matrix A if and only if it 
converges for the transposed matrix A’. 


4.14 Show that the matrix A of Example 2.2 is irreducible, positive definite, 
and weakly row-diagonally dominant. 


4.15 Compute the eigenvalues of the Jacobi iteration matrix for the matrix A 
of Example 2.2. 


4.16 Let A = (a;,) be a nonnegative n x n matrix, i.e., aj, > 0, 7,k =1,...,n, 
and let p(A) < 1. Show that I — A is nonsingular and (I — A)~' is nonnegative. 


4.17 Give a counterexample to show that the Jacobi method, in general, does 
not converge for positive definite matrices (see Theorem 4.12). 


4.18 Show by direct computations that the eigenvectors given by (4.19) are 
orthogonal. 


4.19 Prove the relations (4.21), (4.22), (4.24), and (4.25). 


4.20 Show that 
p(Tn) < max [t(1—t)% + (1-1) ¢] 
0<t<s 


for the two-grid iteration matrix with N damped Jacobi iterations at each step. 


O 
Ill-Conditioned Linear Systems 


For problems in mathematical physics Hadamard [31] postulated three re- 
quirements: A solution should exist, the solution should be unique, and the 
solution should depend continuously on the data. The third postulate is 
motivated by the fact that in general, in applications the data will be mea- 
sured quantities and therefore always contaminated by errors. A problem 
satisfying all three requirements is called well-posed. Otherwise, it is called 
ill-posed. If A : X -—+ Y is a bounded linear operator mapping a normed 
space X into a normed space Y, then the equation Az = y is well-posed 
if A is bijective and the inverse operator A~! : Y ~ X is bounded (see 
Theorem 3.24). Since the inverse of a linear operator again is linear, in 
the case of finite-dimensional spaces X and Y, by Theorem 3.26 bijectivity 
of A implies boundedness of the inverse operator. Hence, in the sense of 
Hadamard, nonsingular linear systems are well-posed. 

However, since one wants to make sure that small errors in the data 
of a linear system will cause only small errors in the solution, there is an 
additional need for a measure of the degree of well-posedness, or stability. 
Such a measure is provided through the notion of the condition number, 
which we will introduce in this chapter. This will enable us to distinguish 
between well-conditioned and ill-conditioned linear systems. For the latter, 
small errors in the data may cause large errors in the solution, and therefore 
their numerical solution requires special care. 

Hence, we will continue the chapter with a brief discussion of the singular 
value cutoff and the Tikhonov regularization as efficient means to deal with 
ill-conditioned linear systems. Our analysis will be based on the singular 
value decomposition and will include the introduction of the pseudo-inverse, 
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or Moore—Penrose inverse. For an extension of these ideas to ill-posed linear 
operator equations in infinite-dimensional spaces we refer to [14, 22, 28, 37, 
39, 43]. 


5.1 Condition Number 


We begin with an example of an ill-conditioned linear system arising through 
a simple least squares problem. 


Example 5.1 We consider the best approximation of a given continuous 
function f : [0,1] — IR by a polynomial 


n 
p(x) = 3 a,a* 
k=0 


of degree n in the least squares sense, i.e., with respect to the Lz norm. 
Using the monomials z +> x*, k = 0,1,...,n, as a basis of the subspace 
P, Cc C{0, 1] of polynomials of degree less than or equal to n (see Theorem 
8.2), from Corollary 3.53 and the integrals 


1 1 
[ victaz = — 


it follows that the coefficients ao,...,@, of the best approximation are 
uniquely determined by the normal equations 
—————— a, = z)xi dz, j7=0,...,n. 5.1 
er =f f@erae, 3 (5.1) 
k=0 
In the special case 
1 


we have the right-hand sides 


[ w! dx » = 0 n 
a ——- dz, =0,...,n. 
J 0 l+2 J 


In particular, ro = 1n2, and from the geometric sum 


44 1 — (—1)/ 2 ; 
5 —j)*7! m1 — oN =1,..., 5 
2 ye i¢z’ ? " 


we deduce that 


j 
; 1 
r; =(-1)’ {ina S cet j=1,...,n. 
a=1 
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Therefore, the solution of (5.1) is of the form 
aj = 6jn2+74;, 7=0,...,7, 


with rational numbers (; and y;. Table 5.1 gives the exact solution of the 
linear system (5.1) obtained by Gaussian elimination carried out in terms of 
rational numbers to compute the coefficients 8; and +; and then inserting 
In2 with ten-decimal-digit accuracy. The results indicate convergence of 
the coefficients to the coefficients a, = (—1)* of the Taylor series for f. 


TABLE 5.1. Exact solution of the linear system (5.1) 


However, if we take as right-hand sides the values obtained for r; by using 
In 2 with five-decimal-digit accuracy, then Gaussian elimination yields the 
results of Table 5.2. 


TABLE 5.2. Numerical solution of the linear system (5.1) 


33.87 
1071.93 | —926.75 | 304.49 


Despite the fact that the changes in the right-hand sides are less than 
0.000005, we obtain drastic changes in the solution. Therefore, qualitatively 
we may say that our linear system provides an example of an ill-conditioned 
system. The matrix of this example is known as the Hilbert matriz. O 


For a quantitative analysis of the phenomenon illustrated by Example 
5.1 we introduce the concept of the condition number. 
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Definition 5.2 Let X and Y be normed spaces and let A: X ~ Y be a 
bounded linear operator with a bounded inverse A~!: Y + X. Then 


cond(A) := ||Al| ||A~*|| 
is called the condition number of A. 


Clearly, cond(A) depends on the chosen norm. Because of (see Remark 
3.25) 
1 = |[Z|| = ||AA~* |] < [ANAT 


we always have cond(A) > 1. Definition 5.2, in particular, includes the 
condition number of a nonsingular n x n matrix A. Here, in the case where 
both the domain and range are given the ¢, norm for p = 1,2, 00 we will 
write cond,(A). 


Theorem 5.3 Let X and Y be Banach spaces, let A: X — Y be a bounded 
linear operator with a bounded inverse A~!: Y + X and let A°: X 3 Y 
be a bounded linear operator such that ||A~+|| ||A®° — Al] < 1. Assume that 
z and x° are solutions of the equations 


Ar=y (5.2) 
and 
Ao? = y, (5.3) 
respectively. Then 
|x? = al] _ cond(A) lly? — yl n ||A° — Al 
z]] Ao - All U tlyll Al] 
1 — cond(A) "Al 


Proof. Writing A®° = A[I + A7}(A® — A)], by Theorem 3.48 we observe 
that the inverse operator [A°]~! = [I + A71(A® — A)]-1A7! exists and is 
bounded by 


—1 
aes — (5.4) 


||A~*]| |] A? — All 
From (5.2) and (5.3) we find that 
A’ (2° — 2) =y° —y— (A® — A)z, 


whence 


x? —¢ =[A*]~'{y? —y — (Ao — A)z} 


follows. Now we can estimate 


llx* — al] < IAP} lly? — yl + A? — All ill 
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and insert (5.4) to obtain 


llz* — all - cond(A) lly? — yl , lA’ — All | 
|z|] 1 — JAMA? — AT] CATH 2 {|All 
From this the assertion follows with the aid of ||A]| ||z|] > |lylI. O 


Theorem 5.3 shows that the condition number may serve as a measure of 
stability for linear operator equations and, in particular, for linear systems. 
A linear system with a small condition number is stable, whereas a large 
condition number indicates instability. We call a linear system with a small 
condition number well-conditioned. Otherwise, it is called ill-conditioned. 

By Theorem 3.31, the condition number of a Hermitian matrix A in the 
Euclidean norm is given by 


|Amax| 
) 
| Amin | 


where Amax and Amin denote the eigenvalues of A with largest and smallest 
modulus, respectively. Table 5.3 is obtained by employing the QR algorithm 
(see Section 7.4) for the computation of matrix eigenvalues. It illustrates 
quantitatively the degree of instability, i.e., the ill-conditionedness of the 
linear system from Example 5.1. 


cond2(A) = 


TABLE 5.3. Condition number for the linear system (5.1) 


5.2 singular Value Decomposition 


In the sequel we wish to introduce some of the basic concepts for the 
approximate solution of ill-conditioned linear systems. Our approach will 
be based on the singular value decomposition of a matrix A, which need 
not be a square matrix. 

For each m x n matrix A, representing an operator A: ©" —+ €™, the 
nxn matrix A*A is Hermitian and positive semidefinite (see Problem 5.9). 
Therefore, the eigenvalues of A* A are real and nonnegative (see Theorem 
3.29). The nonnegative square roots of these eigenvalues are called the 
singular values of A. 

For the remainder of this chapter, by (-,-) we denote the Euclidean 
scalar product in ©”. For an m x n matrix A of rank r, the nullspace 
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N(A) = {x € €” : Ax = 0} has dimension dim N(A) = n — r. We note 
that A* Au = 0 implies that 

|| Aull2 = (Au, Au) = (u, A* Au) = 0; 


i.e., the nullspaces of A and A*A coincide. Hence dim N(A* A) = n—r, and 
therefore A has exactly r positive singular values (counted according to 


their geometric multiplicity, i.e., according to the dimension of the nullspace 
of y?I — A* A). 


Theorem 5.4 Let A be anm xn matriz of rank r. Then there exist non- 
negative numbers 


fy 2 ba 20 2 br > Mr = + = fn = O 
and orthonormal vectors u,...,Un, € ©” and Y4,...,Um € C™ such that 


Au; = pjvj;,  A*vj =pjuj, j=l,...,r, 
Au; =0, j=rtl,...,n, (5.5) 


A*v;j =0, g=rtl,...,m. 


For each x € C” we have the singular value decomposition 


Ar = > pj (@, Uz) v;- (5.6) 


j=1 


Each system (y;,u;,v;) with these properties is called a singular system of 
the matriz A. 


Proof. The Hermitian and semipositive definite matrix A*A of rank r has 
n orthonormal eigenvectors u1,...,Un with nonnegative eigenvalues 


A* Au; = p5uj, j=l,...,n, (5.7) 
which we may assume to be ordered according to py > pe > ++: > py, > O 
and [r41 = °+* = Ln = O. We define 
Vji= 4 Auj, j=l,...,r. 
Hj 


Then, using (5.7) we have 
1 1 
Vj,Uk) = —— (Au;, Au,) = —— (u;, A* Aug) = Ojn, 9,6 =1,...,7, 
(vj, UK) isla | j» Aug) isin )= 4; 
where 6;, = 1 for k = j, and 6;, = 0 for k # j. Further, we compute that 


A*u; = pjuj,j = 1,...,r, and hence the first line of (5.5) is proven. The 
second line of (5.5) is a consequence of N(A) = N(A* 4A). 
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If r <_m, by the Gram—Schmidt orthogonalization procedure from The- 
orem 3.18 we can extend v,,...,v,; to an orthonormal basis v1,...,Um of 
Cc”. Since A* has rank r, we have dim N(A*) = m—r. From this we can 
conclude the third line of (5.5). 


Since the u,,...,U, form an orthonormal basis of ©”, we can represent 
n 
t= S_ (x, uj) Uj) 
j=l 
and (5.6) follows by applying A and observing (5.5). Oo 


Clearly, we can rewrite the equations (5.5) in the form 


A=VDU", (5.8) 
where U = (uj,...,Un) and V = (v),...,Um) are unitary n x n and m xm 
matrices, respectively, and where D is an m xn diagonal matrix with entries 
dj; = p; for 7 =1,...,r and dj, = 0 otherwise. 


Theorem 5.5 Let A be anm x n matriz of rank r with singular system 
(44;,u;,v;). The linear system 


Ar=y (5.9) 
is solvable if and only if 
(y,z) =0 (5.10) 
for allz € C™ with A*z =0. In this case a solution of (5.9) is given by 
“. 1 
zo = >) — (y,v;) uy. (5.11) 
jai YY 


Proof. Let x be a solution of (5.9) and let A*z = 0. Then 
(y, 2) = (Az, z) = (a, A*z) = 0. 


This implies the necessity of condition (5.10) for the solvability of (5.9). 
Conversely, assume that (5.10) is satisfied. In terms of the orthonormal 
basis U1,...,Um of C™ condition (5.10) implies that 


y= d(y. 05) ¥5, (5.12) 


since A*v; = 0 for 7 =r+1,...,m. For the vector xo defined by (5.11) we 
have that 


Tr 


Ato = )_(y, vj) 05. 


j=l 
In view of (5.12) this implies that Azo = y, and the proof is complete. O 
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Since N(A) = span{ur41,...,Un}, the vector x9 defined by (5.11) has 
the property 
(Xo, xr) = 0 


for all x € N(A). In the case where equation (5.9) has more than one 
solution, the general solution is obtained from (5.11) by adding an arbitrary 
solution x of the homogeneous equation Ax = 0. Then from 


Ivo + 2||> = |leollz + 2Re(xo, 2) + [l2ll2 = |Izoll3 + llell2 


we observe that (5.11) represents the uniquely determined solution of (5.9) 
with minimal Euclidean norm. 
In the case where equation (5.9) has no solution, we represent 


in terms of the orthonormal basis v1, ..., Um. Let xo be given by (5.11) and 
let x € C” be arbitrary. Then 


(Ar — Axo, Ato — y) = 0, 


since Ar — Axo € span{v),...,v,} and Azo —y € span{v;41,.-.,Um}. This 
implies 
|| Az — yllz = || Ax — Azolld + || Azo — ys, 


whence (5.11) represents a least squares solution of (5.9) (see Example 2.4). 
Again, it can be shown that (5.11) is the uniquely determined least squares 
solution of (5.9) with minimal Euclidean norm (see Problem 5.11). 

Hence, (5.11) defines a linear operator At: C” + C” by 


1 me 
Aty:= 5° — (y,vj) uj, y € C”, (5.13) 


which of course also allows a representation by an n x m matrix. Due to 
the properties of Aty as discussed above, this operator or matrix is known 
as the pseudo-inverse or Moore—Penrose inverse of A (see {7]). It was first 
introduced by Moore in 1920 and independently rediscovered by Penrose 
in 1955. For an alternative introduction of Al see Problem 5.12. 

By Theorem 3.31 the condition number of a nonsingular matrix with 
respect to the Euclidean norm is given by the quotient of the largest and 
smallest singular value. Theorem 5.5 demonstrates the influence of small 
singular values on the condition of the matrix A. If for some 6 € € we 
perturb the right-hand side by setting y® = y + 6v;, we obtain a perturbed 
solution x° = x + 6u;/p;. Hence, the ratio ||z° — 2|l2/|ly° — yll2 = 1/h; 
becomes large if A possesses small singular values. 
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This observation suggests stabilizing an ill-conditioned linear system by 
damping or filtering out the influence of the factor 1/y; in the solution 
formula (5.11). In the so-called spectral cutoff, the terms in (5.11) cor- 
responding to small singular values are simply neglected. Of course, this 
requires some strategy on how to determine the number of terms being 
summed up in (5.11). A very effective strategy is provided by the following 
discrepancy principle. If the right-hand side y of a linear system is known 
only within an error level 6 then it is quite natural to require Ax = y to be 
satisfied only up to the same accuracy 6, since it does not make much sense 
to try to satisfy the linear system more accurately than the right-hand side 
is known. To describe the discrepancy principle more precisely, given an 
erroneous right-hand side y° with known error level ||y® — y||2 < 6, in the 
spectral cutoff the solution x = Aty of Az = y is approximated by 


p 
1 
Lp -= > _— (y?, vj) Uj (5.14) 


jai" 


for some 0 < p <r. For the following theorem we have to assume that 
Ax = y is solvable. 


Theorem 5.6 Let A be anm xn matrix with singular system (;,uU;,U;) 
and let y € A(C"), y° € C™ satisfy 


lly? — ylle <6 < Ily’lle 
for 6 > 0. Then there exists a smallest integer p = p(6) such that 
\|Azp — y*lle < 6. (5.15) 


This discrecancy principle for the spectral cutoff 1s regular in the sense that 
if the error level 6 tends to zero, then 


tp7Aly, 56730. (5.16) 
Proof. Consider the function F : {0,1,...,r} — IR defined by 
F(p) := [Aap — y°[3 - 6. 


In terms of the singular system, we can write 


F(p) = Y— |(y®,v,)|? — 8. (5.17) 

j=pt1 
Hence, F' is monotonically nonincreasing with F(0) = ||y°||? — 6? > 0 and 
F(r) = —6* < 0 if the rank r of A is equal to m. If r < m, then using 


(y,v;) =0,7 =r+1,...,m (see the proof of Theorem 5.5), we have 


m 
F(r)= © |(y° —y,v,)? —& < lly? — ylB - 8 <0. 
j=rt+l 
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Therefore, there exists a smallest integer p = p(d) such that F(p) < 0. Note 
that p < r. In actual computations, this stopping parameter p is determined 
by terminating the sum (5.14) when the right-hand side of (5.17) becomes 
smaller or equal to zero for the first time. 

In order to show the convergence (5.16), we note that ||Az, — y°|l2 < 6 
implies 


|Azp — yll2 < ||Azp — y'lle + lly® — ylle < 26-30, 5-0, 


i.e., AZp — y, 0 > 0. From this, since At Av = v forall v € span{v,..., ur}, 
we finally can conclude that x, > Aly, 6 > 0. O 


The spectral cutoff method requires the full solution of the eigenvalue 
problem for the matrix A*A, which we will describe in Chapter 7. As an 
alternative, in the following section we shall describe the Tikhonov regu- 
larization, which can be performed without explicitly knowing the singular 
value decomposition. 


5.3 Tikhonov Regularization 


Tikhonov regularization as introduced independently by Phillips in 1962 
and Tikhonov 1963 is obtained from (5.11) by multiplying 1/p; by the 
damping factor 
2 
M5 
an) 


where a is some positive regularization parameter. 


Theorem 5.7 Let A be anm x n matriz of rank r with singular system 
(uj, U;,v;) and let a > 0. Then for each y € C™ the linear system 


AXg + A* Arg = A*y (5.18) 


is uniquely solvable, and the solution is given by 


r 


La = Pi 2 (y, v5) Uj. (5.19) 


j=l 


Proof. For a > 0 the matrix al + A*A is positive definite and therefore 
nonsingular. Since 
au; + A* Au; = (a+ pi )uy;, 


a singular system for the matrix al + A*A is given by (a + ps, Uj, Uj), 
j = 1,...,n. Now the assertion follows from Theorem 5.5 with the aid of 
(A*y,u;) = (y, Au;) and using (5.5). q 
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Corollary 5.8 Under the assumptions of Theorem 5.7 we have conver- 


gence: 
lim (al 4+ A*A)lAty = Aly. 
a- 


Proof. This is obvious from (5.13) and (5.19). O 


Before we proceed with a discussion on how to choose the regularization 
parameter a, we give an interpretation of Tikhonov regularization as a 
penalized least squares method. 


Theorem 5.9 Let A be an m x n matrix and let a > 0. Then for each 
y € C”™ there exists a unique Lg € C” such that 


Ava — yl + allel = inf {Av yl +allz8}. (6.20) 


The minimizing vector Zq is given by the unique solution of the linear 
system (5.18). 
Proof. (Compare to the proof of Theorem 3.51.) We first note the relation 


||Az — yllz + al|al|z = ||Aza — yll2 + allzalls 
+2 Re(@ — %q,Q2q + A* Arg — A*y) (5.21) 


+ ||Az — Aza||3 + alla — calls, 


which is valid for all z,zg € C”. From this it is obvious that the solution 
Lq Of (5.18) satisfies (5.20). 
Conversely, let x, be a solution of (5.20) and assume that 


ALg + A*Aty # A*y. 


Then, setting z := a@%q + A* Ax, — A*y, for & := Fy — €z With ce € R from 
(5.21) we have 


|| Ax — yllz + allallz = || Ata — yllz + al|zallz — 2ea + €*b, 


where 
a:=||z||} and 6 := ||Az||} + allz||3 


are both positive. By choosing ¢ = a/b we obtain 
||Az — yll2 + all|l2 < ||Ava — yll3 + alle|l3, 
which contradicts (5.20). O 
The interpretation of Tikhonov regularization through the above Theo- 


rem 9.9 indicates that it keeps the residual || Az, — y||2 small and stabilizes 
by preventing x, from becoming large through the penalty term a||z,||2. 
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From the proof of Theorem 5.7 we know that the eigenvalues of the 
Hermitian matrix aJ + A*A are given by a + y4, j = 1,...,n. Hence by 
Theorem 3.27 we have that 


2 ) 2 
condg (al + A*A) = et < St , O0<a< pi. (5.22) 


Therefore stability of the linear system (5.18) requires the regularization 
parameter a to be fairly large. On the other hand, in order to keep the 
system (5.18) reasonably close to the original system Az = y, we expect 
that a needs to be small. This observation is made more precise through the 
following considerations on the error occurring in Tikhonov regularization. 


error 


Etotal 


approx 


Faata 
a 
FIGURE 5.1. Total error for Tikhonov regularization 


Given an erroneous right-hand side y® with error level ||y® — y||2 < 6, the 
Tikhonov regularization approximates the solution x = Aly of Ag = y by 
the solution x, of the regularized linear system 


Ale + A*Aty = A*y?. (5.23) 
Then, for the total error, writing 
fq — x = (al + A* A) A*(y® — y) + (al + A*A) 1 A*y — Aly, 
by the triangle inequality we have the estimate 
Ita — allo < \\(al + A*A)~* A" ||2 6 + [(al + A* A) A*y — Alyllo. 
This decomposition shows that the total error consists of two parts: 
Eotat < Eaata + approx: 


The first term, with the aid of Theorem 3.31, can be estimated by 


Edata = ||(el + At A)! A*|ln 6 > 6. 


a+ pe? 
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It reflects the influence of the incorrect data and, for fixed 6, becomes large 
as a — 0, if the smallest positive singular value pz, is close to zero (see also 
Problem 5.16). The second term, 


Eapprox = ||(al + A*A)~) A*y — Alyllo, 


describes the approximation error due to the replacement of Ar = y by the 
regularized equation (5.23), and by Corollary 5.8, it goes to zero as a > 0. 
This error behavior is illustrated in Figure 5.1. 

On one hand, in view of (5.22) the stability of the system requires a large 
regularization parameter a to keep Egata small, i.e., to keep the influence 
of the data error ||y® — y||z small. On the other hand, keeping Eapprox small 
asks for a small parameter a. 

Obviously, the choice of the parameter a has to be made through a 
compromise between accuracy and stability. An efficient strategy to achieve 
this is again provided by the discrepancy principle. In the following theorem 
we need to assume that Az = y is solvable. 


Theorem 5.10 Let A be an m xn matrix and let y € A(C"), y® € ©™ 
satisfy 
lly? — ylle <6 < lly lle 


for 6 > 0. Then there exists a unique a = a(d) > 0 such that the unique 
solution tq of (5.23) satisfies 


|Ate — y°|lo = 6. (5.24) 


This discrecancy principle for Tikhonov regularization is regular in the 
sense that if the error level 6 tends to zero, then 


La + Aly, 630. (5.25) 
Proof. We have to show that the function F' : (0,00) — IR defined by 
F(a) := ||Ata — y*|l3 — & 
has a unique zero. In terms of a singular system, from the representation 
(5.19) we find that 


m 2 
ay 
F(a) = > — “55 [ly oP — 0. 
j=l (a + p53)? 


Therefore, F' is continuous and strictly monotonically increasing with the 
limits F(a) + —6? < 0,a > 0, and F(a) > |ly*||2 — 6? > 0, a > ov. 
Hence, F’ has exactly one zero a = a(6). 

Note that the condition ||y® — y|lz < 6 < |ly°||2 implies that y 4 0. Using 
(5.23), (5.24), and the triangle inequality we can estimate 


lly" ll2 — 6 = [ly°llz — ]Aza — y° lz < ||Azalle 
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and 


al|Aral|2 = ||AA*(y° — Aza)ll2 < ||AA"l26. 
Combining these two inequalities and using ||y°||2 > ||y||2 — 6 yields 


| AA*}]2 6 
a < ~_———- . 
~ Ilylle — 26 


This implies that a > 0, 6 — 0. Now the convergence (5.25) follows from 
the representations (5.13) for Aty and (5.19) for 2, (with y replaced by 
y°) and the fact that ||y° — y|l2 3 0, 6 > 0. Oo 


In practice, of course, one does not need to determine the regularization 
parameter satisfying (5.24) exactly. Usually the following strategy will be 
sufficient: Choose some moderately sized a and then keep decreasing a by 
a constant factor y, say y = 0.5, until F'(a@) becomes negative. 

In order to illustrate that Tikhonov regularization works, Table 5.4 gives 
some numerical results for the linear system of Example 5.1 with the erro- 
neous right-hand side generated by using In2 = 0.69315 and choosing the 
regularizing parameter a = 107! (without attempting to use Theorem 
5.10). 


TABLE 5.4. Regularized solution of the linear system (5.1) 


Problems 


5.1 For the condition number of linear operators show that 
cond(AB) < cond(A) cond(B). 
5.2 Let A be an n x n matrix and Q be a unitary n x n matrix. Show that 
cond2(QA) = cond2(A) 


and 
cond2(A” A) > cond2(A). 


5.3 Determine cond2(A) for the matrix A of Example 2.1 and discuss its be- 
havior for large n. 
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5.4 Find the inverse of the matrix 


and find the condition numbers cond,(A) for p = 1, 2, co. 


5.5 Find the inverse of the matrix 


100 1 4 9Q 
1 10 5 -!1 


and find the condition numbers cond,(A) for p = 1, 2,00. 


5.6 Calculate cond..(A) for the matrix 


1 1 1 
A=j{ 1 10 100 ; 
1 100 10000 
Show that one can improve the condition of a matrix by scaling through calcu- 
lating cond..(DA) where D is the diagonal matrix 
D = diag(1/3, 1/111, 1/10101). 


5.7 Let A = (a;,) be an n x n matrix satisfying 


Tr 
So lael=1, f=1,...,n. 
k=1 


Show that 
cond..(A) < cond..(DA) 


for all n x n diagonal matrices D (see Problem 5.6). 
5.8 For a nonsingular matrix A show that 


1 


] ; - os 
cond(A) = JAI min{||B|| : A+ B is singular}. 


This indicates that if a nonsingular matrix has a large condition number, it is 


close to a singular matrix. 


5.9 Show that for an m x n matrix A the n x n matrix A*A is Hermitian and 
positive semidefinite. 


5.10 Find the singular value decomposition of 


1 0 1 1 
A= 1 0 -1 O }. 
1 1 0 1 
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5.11 Show that Aly is the least squares solution of Ax = y with minimal norm. 
5.12 Show that the pseudo-inverse A‘ is uniquely determined by the properties 
AA! =(AA')*, A'TA=(A'A)*, AATA=A, ATAAT= Al, 

Express the pseudo-inverse in terms of the decomposition (5.8). 
5.13 For the pseudo-inverse show that (A')' = A and (At)* = (A‘)!. 
5.14 Give an example to show that in general, (AB)! 4 BIA. 


5.15 What is the pseudo-inverse of A : C” + C™ given by Ax = (z,a)b with 
aé™ andbe €"? 


5.16 For an m x n matrix show that 


_ 1 
I+ A*A)'A*lle < — 
for a > 0. 


5.17 Give an alternative proof of Theorem 5.9 by using the necessary and suf- 
ficient conditions for the minimum of a function of n variables. 


5.18 Let X and Y be finite-dimensional pre-Hilbert spaces and let A: X ~ Y 
be a linear operator. Show that there exists a uniquely determined linear operator 
A* :Y — X with the property 


(Az, y)y — (z, A*y)x 


for all x € X and y € Y. Use this result to formulate and prove a generalization 
of Theorem 5.9 for the minimization of 


Ax — ylly + allzl|x. 
5.19 Show that 
(x,y) := So ay95 + D_(@j — 25-1) (9) - j-1) 
j=0 j=l 


defines a scalar product on €”. Discuss its use in Tikhonov regularization as 
indicated in Problem 5.18, where in addition to large components of the solution 
vector oscillations between consecutive components are also penalized. 


5.20 Show that A: C[0, 1] + C[0, 1] defined by 


(Af)(z) := / fly)dy, 2 € [0,1] 
8) 


is a bounded linear operator that does not have a bounded inverse; i.e., show 
that differentiation is an ill-posed problem. 


6 


Iterative Methods for 
Nonlinear Systems 


In this chapter we will study the solution of systems of nonlinear equa- 
tions. As opposed to linear equations, no explicit solution techniques are, 
in general, available for nonlinear equations, and hence their solution com- 
pletely relies on iterative methods. In the first section we shall begin with 
the application of the Banach fixed point theorem for systems of nonlin- 
ear equations with one or several variables. Given the fact that iterative 
techniques have a long history in mathematics, the significance of Banach’s 
fixed point theorem originates from its unified approach, covering a wide 
variety of different successive approximation methods. 

In the second section, we will continue with the study of Newton’s it- 
eration method for finding zeros of functions of one or several variables. 
This iteration scheme is attributed to Newton, since in 1669 he developed 
a solution method for cubic equations by linearization that may be viewed 
as a precursor of what is now known as Newton iteration. He also used this 
method for approximately solving Kepler’s equations for planetary motion. 

In the concluding two sections of this chapter we will consider the appli- 
cation of Newton’s method for finding zeros of polynomials and its modifi- 
cation into the more recently developed Levenberg—Marquardt scheme for 
solving the least squares problem. 

Given the vast number of iterative methods available for nonlinear equa- 
tions, we will confine our presentation to describing the fundamental ideas 
and will not aim at a complete treatment of the subject. 
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6.1 Successive Approximations 


In this section, we will consider systems of n nonlinear equations for n 
unknowns of the form 


where x = (21,...,%n)? and f(z) = (fi(a1,...,2n),---,fn(21,---;2n)?. 
We begin by studying the case of a single nonlinear equation with one 
unknown. Obviously, in one dimension, solving f(x) = x geometrically 
corresponds to determining the intersection of the graph of the function f 
with the straight line described by the function z+ z. 


Theorem 6.1 Let D C BR be a closed interval and let f : D—- D bea 
continuously differentiable function with the property 


q := sup |f'(x)| < 1. 
rED 


then the equation f(x) = x has a unique solution x € D, and the successive 
approximations 
tyi1:=f(tvy), v=O0,1,2,..., 


with arbitrary xo € D converge to this solution. We have the a priori error 
estimate 


q’ 
je, — 2] < 72 |e ~ 20) 
and the a posteriori error estimate 
jz, —2| < i-g \fy — Ly-1 
for allv EN. 
Proof. Equipped with the norm || - || = |-| the space IR is complete. By the 


mean value theorem, for z,y € D with z < y, we have that 


f(z) — f(y) = F(a - y) 


for some intermediate point € € (x,y). Hence 
If(x) - fy) < sup f'()| |z — y| = ala — yl, 


which is also valid for z,y € D with x > y. Therefore, f is a contraction, 
and the assertion follows from the Banach fixed point Theorem 3.46. O 


Figure 6.1 illustrates graphically the successive approximations for func- 
tions f with positive and negative slope, respectively, of absolute value 
less than one. Note that the sequence (xz,) converges to the fixed point 
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monotonically if f has positive slope and that it converges with values al- 
ternating above and below the fixed point if f has negative slope. In both 
cases the slope of the function f has absolute value less than one in a 
neighborhood of the fixed point. From drawing a corresponding figure for 
a function with a slope of absolute value greater than one it can be seen 
that the corresponding iteration will move away from the fixed point (see 
Problem 6.2). 


U30Q D1 ZO L123 «2 ZO 


FIGURE 6.1. Fixed point iteration 


The following theorem states that for a fixed point x with |f'(x)| < 1 we 
always can find starting points xg ensuring convergence of the successive 
approximations. 


Theorem 6.2 Let x be a fized point of a continuously differentiable func- 
tion f such that |f'(x)| < 1. Then the method of successive approximations 
Ly41 := f(zx,_) ts locally convergent; i.e., there exists a neighborhood B of 
the fixed point x such that the successive approximations converge to x for 
allzy € B. 


Proof. Since f' is continuous and | f'(x)| < 1, there exist constants 0 < q <1 
and 6 > 0 such that |f’(y)| < q for all y € B := [x —6,x +6]. Then we have 
that 


lf(y) -z| =|If@y) -—f(z)|<aly-2l<ly-2| <6 
for all y € B; 1.e., f maps B into itself and is a contraction f : B > B. 
Now the statement of the theorem follows from Theorem 6.1. O 


Theorem 6.2 expresses the fact that for a fixed point x with |f'(x)| <1 
the sequence 2,41 := f(x,) converges if the starting point ro is sufficiently 
close to x. In practical situations the problem of how to obtain such a good 
initial guess is unresolved in general. Frequently, however, a good estimate 
of the fixed point might be known a priori from the underlying application 
or might be deduced from analytic observations. 

The following examples illustrate that in some cases we also have global 
convergence, where the successive approximations converge for each start- 
ing point in the domain of definition of the function f. 
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Example 6.3 In order to describe a division by iteration, for a > 0 we 
consider the function f : IR - R given by f(x) := 2x — az’. The graph of 
this function is a parabola with maximum value 1/a attained at 1/a. By 
solving the quadratic equation f(x) = x it can be seen that f has the fixed 
points « = 0 and x = 1/a. Obviously, f maps the open interval (0, 2/a) 
into (0,1/a). Since f'(a) = 2(1 — ax), we have f’(0) = 2 and f'(1/a) = 0. 
From the the property x < f(z) < 1/a, which is valid for 0 < x < 1/a, 
it follows that the sequence 2,4), := 2x, — ax? is monotonicly increasing 
and bounded. Hence, the successive approximations converge to the fixed 
point x = 1/a for arbitrarily chosen xp € (0, 2/a). Figure 6.2 illustrates the 
convergence. The numerical results are for a = 2 and two different starting 
points, 29 = 0.3 and Xp = 0.4. 0 


0.30000000 | 0.40000000 


0.42000000 | 0.48000000 
0.48720000 | 0.49920000 
0.49967232 | 0.49999872 


1/a 2/a 


FIGURE 6.2. Division by iteration 


Example 6.4 For computing the square root of a positive real number a 
by an iterative method we consider the function f : (0,00) — (0,00) given 
by ' 
a 
F(z) = 2 (x + =) 

By solving the quadratic equation f(z) = x it can be seen that f has 
the fixed point x = fa. By the arithmetic geometric mean inequality we 
have that f(x) > a for x > 0; ie., f maps the open interval (0, 00) into 
[,/a, oo), and therefore it maps the closed interval [,/a, co) into itself. From 


it follows that 


Hence f : [,/a, 00) > [\/a, 00) is a contraction. Therefore, by Theorem 6.1 
the successive approximations 


1 
t= 5 (a tS), y=—0,1,..., 
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converge to the square root ,/a for each xp > 0, and we have the a posteriori 
error estimate 
|\Va — ap| < |zy — y-1|. 


Figure 6.3 illustrates the convergence. The numerical results again are for 
a= 2. O) 


59.00000000 
2.70000000 
1.72037037 


1.44145537 
1.41447098 
1.41421359 
1.41421356 


FIGURE 6.3. Square root by iteration 


In both of Examples 6.3 and 6.4 the numerical values exhibit a very 
rapid convergence. This is due to the fact that because of f’(x) = 0 at the 
fixed point, the contraction number is very small. We shall elaborate on 
this observation later when we consider Newton’s method. 


TABLE 6.1. Iterations for Example 6.5 


1.00000000 


Example 6.5 Consider the function f : [0,1] — [0,1] given by 


Here we have 


0.54030231 
0.85755322 
0.65428979 
0.79348036 
0.70136877 
0.76395968 


f(z) :=cosz. 


0.72210243 


0.73908513 
0.73908514 
0.73908513 
0.73908513 


q= sup |f'(x)| =sin1 <1, 
0<z<1 
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and Theorem 6.1 implies that the successive approximations 2,41 := COS Z, 
converge to the unique solution x of cosx = x for each Zo € [0,1]. Table 
6.1 illustrates the convergence, which is notably slower than in the two 
previous examples. oO 


By the following example we illustrate how to obtain a fixed point of 
a function with derivative greater than one by working with the inverse 
function. 


Example 6.6 The function h : (0,1) — (—oo, oo) given by h(x) := «+Inz 
is strictly monotonically increasing with limits limz_,9 h(x) = —oo and 
limz-4¢9 h(x) = oo. Therefore, the function f(z) := —Inz has a unique 
fixed point x. Since this fixed point must satisfy 0 < x < 1, the derivative 


1 
If(@l=—>1 


implies that f is not contracting in a neighborhood of the fixed point. 
However, we can still design a convergent scheme because x = —Inz is 
equivalent to e~* = x. We consider the inverse function 


of f, which has derivative |g'(x)| = e~” < 1 at the fixed point, so that we 
can apply Theorem 6.2. Obviously, for each 0 < a < 1/e the exponential 
function g maps the interval |a, 1] into itself. Since 


qg= sup |g (x)| =e? <1, 
a<z<l 


by Theorem 6.1 it follows that for arbitrary zp > 0 the successive approx- 
imations 2,4; = e *” converge to the unique solution of z = e7*. D 


Now we will extend Theorem 6.1 to systems of nonlinear equations. A 
subset D of a linear space X is called conver if 


Ar+(1—A)ye D 


for all x,y € D and all \ € (0,1), i-e., if the straight line connecting z and 
y is contained in D. 


Theorem 6.7 Let D C IR” be open and convex and let f : D > IR” be a 


mapping 
f(z) = (filti,-.-,2n),---)Fn(t1,---,2n)", 


where the f; : D+ R,j =1,...,n, are continuously differentiable func- 


tions. By af 
Mig) — | S49 
fe)= (52 @) 
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we denote the Jacobian matrix of f. Then we have the mean value theorem 


Ifo) — F(W)II < pmax, [Ifa + (1 — Aull lz — all 


for all x,y € D (and all norms || - || on IR”). 


Proof. Let g: [0,1] + IR” be continuous. We will show that 


If 9) an| < / “Igll a, 6.1) 


where the integral on the left-hand side has to be understood as the vector 
of the integrals over the components of g. The function ’ + ||g(A)|| is 
continuous, since the norm is a continuous function. Therefore, the integral 
on the right-hand side of (6.1) is well-defined. Consider the equidistant 
subdivision 4; = i/m,i = 0,1,...,m, for m € IN. Then we have the 
converging Riemann sums 


m 1 
S [lg(AdI| Ai — At-1) 9 / g()|| dd, m— 00, 
i=1 0 


and 
90) 0 ri-1 > [ow A, m-> Oo. 


From the second limit, by the continuity of the norm we conclude that 


1 
>| f g(2) da, m —> ©O. 
0 


Now (6.1) follows by passing to the limit m — oo in the inequality 


< » lg(Aa) I] (As — Ae—1), 


which is a consequence of the triangle inequality. 
Since D is convex, for all x,y € D we have that 


» gra) (Ai — Ai-1) 


m 


DY, 9A) As — As-1) 


wl 


td 
fy(2) — fw) = f ay File + (1 — Aju] da, j=l,...,n. 


By the chain rule we compute 


4 fia +(1— Au) = So SE tre + (aul en — 


k= 19 
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and therefore 
f(x) - Hod = [9 5 de + (1 — A)y] (ze — yR) d; 
= O 
i.e., in vector form, 


f(x) — fly) = | f'[Ae + (1 — Ay] (a — y) da 


From this, with the aid of (6.1) and the continuity of AH f'[Az+ (1—A)y], 
we obtain 


] 
f(z) -—F@)Il < i I f"[Ax + (1 — A)ylll lle — yll aa 
< max [if'Dx + (1 d)glll lle - all 
which ends the proof. oO 


Theorem 6.8 Let D C IR” be closed and convex (with a nonempty inte- 
rior) and let f : D > D be a continuous mapping. Assume further that f 
is continuously differentiable in the interior of D and that its Jacobian can 
be continuously extended to all of D such that 


sup ||f'(x)|| < 1 
xED 


in some norm ||-|| on IR". Then the equation f(x) = x has a unique solution 
x € D, and the successive approximations 


Yy41:=f(@v,), v=O0,1,2,..., 


converge for each xo € D to this fixed point. We have the a priori error 
estimate 


Iz, — 21] S 7 Ilt1 — oll 


1—- 


and the a posteriori error estimate 


le — all <a Mee — a 


forallv EN. 


Proof. By the mean value Theorem 6.7 the mapping f : D > D is a con- 
traction. 0 
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By Theorem 3.26 we have that each of the conditions 


n 


Of, 
sup max OF; yi <1, 
nm 
Of, 
sup max OF; (x)| <1, 
ZED R=1y....n Or, 
1/2 
n Of; 2 
sup — <1 
zreED >> Bary | 


ensures convergence of the successive approximations in Theorem 6.8. 
The following local convergence theorem can be proven analogously to 
Theorem 6.2. 


Theorem 6.9 Let x be a fixed point of a continuously differentiable func- 
tion f such that || f'(x)|| < 1 in some norm ||- || on IR". Then the method 
of successive approximations r+, := f(z_) is locally convergent; i.e., there 
exists a neighborhood B of the fixed point x such that the successive approz- 
imations converge to x for all starting elements zo € B. 


Example 6.10 For the system 


x, = 0.5cosz, — 0.5sin x2 


xq = 0.5sinz; + 0.5cos ze 


we have 
—0.5sin x] Seane 


0.dcosz, —OQ.d5sin 22 


f(x) = ( 


and therefore ||f'(x)|l2 < /0.5 for all z € IR?. Hence Theorem 6.8 is 
applicable. 0 


The reader will not be surprised to learn that for speeding up convergence 
of the successive approximations, concepts developed for linear equations 
like relaxation methods or multigrid methods can also be successfully em- 
ployed in the nonlinear case. However, since we discussed these methods 
in some detail in Sections 4.2 and 4.3 for linear equations, we shall refrain 
from repeating the analysis for nonlinear equations. 


6.2 Newton’s Method 


We now want to determine zeros of a function of n variables; i.e., we want 
to solve equations of the form 
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where f : D > IR” is acontinuously differentiable function defined on some 
open subset D C IR”. 

We begin by considering a function of one variable. Let 29 be an approx- 
imation to a zero of the function f. In a neighborhood of x9, by Taylor’s 
formula we have that 


f(x) = f(xo) + f'(xo) (x — 20) =: g(2). (6.2) 


Therefore, we may consider the zero of the affine linear function g as a 
new approximation to the zero of f and denote it by z,. From the linear 
equation 


f (xo) + f' (xo) (x4 — Xo) = 0 (6.3) 
we immediately obtain 
tM =2- f (Zo) 
f'(zo) 


Geometrically, the affine linear function g describes the tangent line to the 
graph of the function f at the point zo. 

This consideration can be extended to the case of more than one variable. 
Given an approximation Zo to a zero of f, by Taylor’s formula we still have 
the approximation (6.2), where now, as in the previous section, 


denotes the Jacobian matrix of f. Again we obtain a new approximation 
x, for the solution of f(z) = 0 by solving the linearized equation (6.3), i-e., 
by 

x1 = to — [f'(x0)]* (20). 


Geometrically, the function g of (6.2) corresponds to the hyperplane tan- 
gent to f at the point Zo. 

Iterating this procedure leads to Newton’s method, as described in the 
following definition. In the case of one variable, the geometric situation is 
shown in Figure 6.4. 


Definition 6.11 Let D C IR” be open and let f : D — IR” be a continu- 
ously differentiable function such that the Jacobian matrix f'(x) is nonsin- 
gular for allx € D. Then Newton’s method for the solution of the equation 


f(x) =0 
is given by the iteration scheme 
Ly4 1 >= Ly — [f'(zv)]7* (av), y=0Q,1,..., 


starting with some Zo € D. 
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t2 fi Lo 


FIGURE 6.4. Newton’s method 


We explicitly note that 2,41 is obtained by solving the system of linear 
equations 


f'(av) (av — tv4i) = flav) 


for £) — Z,+43; 1.€., no matrix inversion is required. 


Example 6.12 For the function 


f(a) :=a~- 


£ 
where a > 0, the Newton iteration is given by 
_ 2 
Ly41 i= 2Ly — axy,. 

By Example 6.3 we have convergence for all ro € (0, 2/a). oO 


Example 6.13 For the function 
f(x) =: 27 —a 


where a > 0, the Newton iteration is given by 


x =i cy, + — 
uth 95 V Ly . 


By Example 6.4 we have convergence for all ro € (0, 00). oO 


Of course, we cannot expect that Newton method’s will always converge. 
However, by the following analysis we can assure local convergence. 


Theorem 6.14 Let D C IR” be open and conver and let f : D > IR” be 
continuously differentiable. Assume that for some norm || - || on IR" and 
some Xo E D the following conditions hold: 
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(a) f satisfies 
f(z) — FDI < rile - yl 


for all x,y € D and some constant y > 0. 


(b) The Jacobian matrix f'(x) is nonsingular for all x € D, and there 
exists a constant B > 0 such that 


IF (@)" I< 8, we D. 
(c) For the constants 
a := ||[f'(xo)]"" f(x) || and q:= apy 
the inequality 
a<5 
is satisfied. 


(d) For r := 2a the closed ball Blxo,r| := {x : ||x — x9|| < r} is contained 
in D. 


Then f has a unique zero z* in Blxo,r|. Starting with ro the Newton 
iteration 


ty41:= ty —[f'(av)|) flav), v=0,1,..., (6.4) 


is well-defined. The sequence (x,) converges to the zero x* of f, and we 
have the error estimate 


Ic, — x*|| < 2aq? ~', v=0,1,.... 


Proof. 1. Let x,y,z € D. From the proof of Theorem 6.7 we know that 


fly) - f(z) = | f'lAx + (1 — A)y] (y — 2) dd. 
Hence 
f(y) — f(z) — Fz) (y-2) = / {f'[Az + (1 —A)y] — f'(2)} (y — 2) dd, 


and estimating with the aid of (6.1) and condition (a) we find that 


f(y) — f(x) — f'(2) (y —2) 
< ylly — al / WA(a — 2) + (1- Ay — lla 


< 5 lly — xl {lle — all + lly — 211}. 


6.2 Newton’s Method 105 


Choosing z = x shows that 


IF) — F@) — F(@) (y-a)II < 3 ya? (6.5) 
for all x,y € D, and choosing z = Zo yields 
f(y) — f(x) — f(2o) (y -— 2)I| < rly - all (6.6) 


for all x,y € Blxo,r]. 
2. We proceed by proving through induction that 


l|Izn —Zo|| <r and |r, —2,_,|| < ag” 1, y=1,2,.... (6.7) 


This is valid for v = 1, since 


le — aol] = IEF (eo) F(ao)l| = a = 5 <r 
aS a consequence of conditions (c) and (d). Assume that the inequalities 
(6.7) are proven up to some v > 1. Then by condition (b) and since 
ty € Blzo,r] C D, the element 2,41 is well-defined. With the aid of condi- 
tion (b), the definition (6.4) applied to x,, the estimate (6.5), the induction 
assumption, and the definition of g we can estimate 


zv41 — trl = INF (ev) F(@ Il < BIlF (ze) 
= B\|f (xv) — f(av-1) — f'(av-1)(@v — 2v-1)|| 


B B v—-1_ 2 Q vi vi 
< VY I|zv — £y_1 ||" < VY jag? J —~ — q’ 1 < aq’ 1 
2 2 2 
From this, with the help of the triangle inequality, the induction assump- 

tion, and condition (c), we obtain that 


lzv41 — ol] < [lzv41 — zr|| +--+ + |lz1 — zol| 


v a 
<a(itqt@ta +--+ ¢ 1) Spay Sean 


i.e., the inequalities (6.7) also hold for v + 1. 
3. For > 0, using g < 1/2, we now can estimate 


[zr — Ly+pll < |lz, — Zy+4il|+---+ lZv4p—1 — Ty+p| 


quti_y 
+. 


<a ("- +4q 


4 geet) 


v v v get v 
ag’ +e +o (0? ) <2ag" vt, 
(6.8) 
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From this we observe that (z,) is a Cauchy sequence, since g < 1/2 and by 
Theorem 3.39 the limit 


x* = lim z, 
Vv-C 


exists. Passing to the limit v — oo in (6.7) we obtain ||2* — z9|| <r, ice., 
xz* € Blzxo,r], and passing to the limit  — oo in (6.8) the error estimate 
of the theorem follows. 

4. We now show that the limit x* is a zero of the function f. With the aid 
of (6.4) and condition (a) we can estimate 


F(ey Il = Fey) (ey4i - av)|| 
< [lf (ey) — f'(@o) + Feo) || Ilev41 — 2p| 


< [lev — 2oll + IIF"(#o) I] Iltr41 — ty|] 40, v > 00. 


Hence f(r,) — 0, v — ov, and the continuity of f implies that indeed 
f(z*) = 0. 
5. We conclude the proof by showing that x* is the only zero of f in the 
ball Blxo,r]. For this we consider the function g : B[zo,r] — IR” defined 
by 

g(x) = x — [f'(zo)]~* f(z). 


From conditions (b) and (c) and the inequality (6.6), by writing 
g(x) — 9(y) = [f'(20)]-" {fF (y) — f(x) — f'(wo)(y — 2)} 


we deduce that 


llg(z) — gW)IL < Brrily — zl] < 2¢lly — 2|| 


for all x,y € B[zo,r]; i-e., g is a contraction. Therefore, by Theorem 3.44 
the function g has at most one fixed point in B[xo,r]. Now uniqueness 
of the zero of f in B[zo,r] follows from the equivalence of the equations 
g(x) = x and f(x) = 0. Oo 


Our main application of Theorem 6.14 consists in deriving the following 
local convergence result for Newton’s method. 


Corollary 6.15 Let D C IR” be open and let f : D + IR” be twice con- 
tinuously differentiable, and assume that x* is a zero of f such that the 
Jacobian f'(z*) is nonsingular. Then Newton’s method is locally conver- 
gent; t.e., there exists a neighborhood B of the zero x* such that the Newton 
iterations converge to x* for all ro € B. 


Proof. Since f is twice continuously differentiable, by the mean value The- 
orem 6.7 applied to the components of f' there exists y > 0 such that 


F(x) — FDI < ale - yl 
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for all z,y in some closed ball B[z*, p]| centered at «*. We write 
f'(x) = f'(a*){7 + [F'(2*)) TF) — F(a" I} 


and deduce from the above estimate and Theorem 3.48 that the radius p 
of B[x*,p] can be chosen such that f’(x) is nonsingular on B[z*, p| and 
Lf’ (a*)] 4 || < @ for all x € B[x*, p| and some constant 6 > 0. 

Since f is continuous, f(x*) = 0 implies that there exists 6 < p/2 such 


that 
f(zo)|| < min 1% a5} 


for all ||zq — 2*|| < 5. Then, after setting a := ||[f’(z0)|~' f(zo)|| we have 
the inequalities 


aby < \|f(r0)||B°y < : 


and 
2a < 2Bl|f(ao)|| < £. 


* 


Hence for the open and convex ball B(x*,p) and for each zo with 
[zo — 2*|| < 6 the assumptions of Theorem 6.14 are satisfied. Oo 


Corollary 6.16 Let f : (a,b) — R be twice continuously differentiable 
and assume that x* is a simple zero of f. Then Newton’s method is locally 
convergent. 


Proof. For simple zeros we have f'(2*) 4 0. O 
Example 6.17 For the function f(z) := x — cosx the Newton iteration 
reads 
: _ Ly — COS Ly 
yeh wv 1 4+sin2z, 
and leads to the numerical values of Table 6.2. ‘= 


TABLE 6.2. Newton iterations for Example 6.17 


1.00000000 
0.75036387 


0.73911289 
0.73908513 
0.73908513 
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Example 6.18 For the function f(z) := x — e~* the Newton iteration 
reads 
Ly —-e 7 
Ly41 := Ly — Ttent 
and leads to the numerical values of Table 6.3. O 


TABLE 6.3. Newton iterations for Example 6.18 


1.00000000 
0.53788284 


0.56698699 
0.56714329 
0.56714329 


In both examples we observe that the speed of convergence is consider- 
ably improved as compared with the simple successive approximations of 
Examples 6.5 and 6.6. For a general description of this more rapid conver- 
gence of Newton’s method we need the following definition. 


Definition 6.19 A convergent sequence (x,) from a normed space with 
limit x is said to be convergent of order p > 1 if there exists a constant 
C' > 0 such that 


IItv41 — 2|| < Cla, —s||?, v= 1,2,.... 


Convergence of order one or two is also called linear or quadratic conver- 
gence, respectively. We note that the convergence in Banach’s fixed point 
Theorem 3.45 is, in general, linear. 


Theorem 6.20 Under the assumptions of Theorem 6.14 Newton’s method 
converges quadratically. 


Proof. Using condition (b) of Theorem 6.14 and the inequality (6.5) we can 
estimate 


c* — 2y4al| = lle* — av + Fev (ee) | 
< IIL ev I Fe") — fle) — f"(av)(a* — 2) | 
< 2 Int — ay IP, 


2 
since f(z*) = 0. O 
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Roughly speaking, the quadratic convergence of Newton’s method means 
that the number of correct digits in the numerical approximation is doubled 
in each iteration step, as observed in Examples 6.3, 6.4, 6.17, and 6.18. 
Although by this property Newton’s method is very attractive, it has to be 
observed that one step of the Newton iteration for nonlinear systems can be 
very costly both through the need for evaluating the entries of the Jacobian 
f'(z,) and through the cost of solving the linear system to arrive at the 
new iteration r,,,. Therefore, a great variety of modifications of Newton’s 
method have been developed that mitigate, in particular, the first difficulty. 
These modified Newton methods, in general, are of the form 


Ly4, i= t,-A,f(ty), v=0,1,...; 


i.e., the inverse [f’(x,)]~+ of the Jacobian is replaced by some approximat- 
ing matrix A,. Here we will only briefly mention two classical and simple 
possibilities for avoiding the evaluation of the Jacobian at each iteration 
step. 

In the simplified, or frozen, Newton method, for all steps the matrix A, 
is kept the same and chosen as the inverse of the Jacobian for the starting 
point; i.e., the iteration scheme is 


Ly 41s Ly — [f'(to)]* f(z), y= Q, 1, cee 


Geometrically, in the one-dimensional case this means that the tangent line 
of f at x, is replaced by the parallel to the tangent line of f at zo passing 
through (2,1, f(£,)). 


Theorem 6.21 Under the assumptions of Theorem 6.14 the simplified 
Newton method converges linearly to the unique zero of f in Blxo,r). 


Proof. Recall that the function 
g(x) := x —[f'(xo)]"* f(z) 


defined in the proof of Theorem 6.14 is a contraction. We show that g maps 
B[zo, 1] into itself. For this we write 


xo — g(x) = [f'(z0)]"{ f(z) — f(xo) — f'(z0)(@ — 20) + F(z0)}. 


Then estimating with the help of conditions (b), (c) and (d) and the in- 
equality (6.5) we obtain 


IIlg(x) — zoll < sa ||z — xol|? +a < 2a?By +a = (2q+1)a<2a=r 


for all x with ||xz — xo|| <r. Now the statement of the theorem follows from 
the Banach fixed point Theorem 3.46. Oo 
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In the secant method for a function of one variable the derivative f’(z,) 
is approximated by the difference quotient and the corresponding iterative 
scheme is given by 


Ly — Lyp-} 


Ly4y ce Ly — f(x.) — flav-1) f(z,), y= 0, 1, cee (6.9) 


Geometrically, this means that the tangent line at x, is replaced by the 
secant line through the two points z, and x,_,;. Obviously, this method 
needs two initial elements zr» and x,. Generalizations to functions in IR” 
are possible (see [47]). 

In general, for the simplified Newton method and for the secant method 
we can expect only linear convergence. The idea underlying the more so- 
phisticated modified Newton methods is to choose the approximating ma- 
trices A, in a manner leading to an improvement over linear convergence 
without requiring the computational costs of the full Newton method. In 
the so called rank one methods suggested by Broyden in 1965, in each it- 
eration step the matrix A, is updated from the previous matrix A,_; by 
adding only a matrix of rank one such that the resulting iteration scheme is 
superlinearly convergent. Roughly speaking, the latter means that for the 
sequence rz, > ZT, Vv > ov, we have that 


I|tv41 — 2|| < Crile, — a], v=1,2,..., 


such that C, — 0,v — oo. For details we refer to the literature (see 
(20, 47]). 


6.3 Zeros of Polynomials 


In this section we shall apply Newton’s method to the computation of the 
zeros of polynomials. Finding the zeros of polynomials is a classical problem 
in mathematics and numerical analysis despite the fact that it very seldom 
occurs in applications. We first observe that Newton’s method also works 
for a complex function of a complex variable, allowing the computation of 
complex zeros. 

Consider the polynomial 


p(x) = age” +.a,2" | + aga” * +--+ + an_-12 + On 


with real or complex coefficients ag, a1,...,@,. For the application of New- 
ton’s method, in each iteration step we need to compute the values of p and 
p' at the point z,. This can be effectively done by the Horner scheme. This 
is based on writing the polynomial in the form of nested multiplications 


p(z) = (---((aoz + a1)z + G2)Z +--+ + An-1)2 + Gn, 
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which suggests the recursion 
bm = bm-1Z2 +m, m=l,...,n, (6.10) 


starting with bp = ap. Performing these n multiplications and additions, 
we arrive at the value of the polynomial p(z) = bn. 
For the polynomial 


pi(x) := box"! + bya"? 4 boa” 2 +--+ + bp_2e + dn_1, 


using (6.10) we compute 


n—-l1 


pi(z) (2 — z) + bn = So bm gz” 1—™(g — z) + bn = on” ~™ = p(x). 


m=0 


This implies that for a zero z the Horner scheme provides the coefficients 
of the polynomial obtained by dividing p by the linear factor x — z. In 
addition, we have that 


p (x) = p\(x) (x — z) + pi(z), (6.11) 
and in particular, 
p (z) = pi(z). 


Hence, applying the Horner recursion to the polynomial p,; yields the value 
of the derivative p'(z). By repeating this process recursively, we can deter- 
mine all the derivatives of p at the point z, since by induction, from (6.11) 
we obtain that 


p) (x) = p (2) (a — z) + kp\*~) (a), 
whence 
p'*)(z) = kp\*))(z) 


follows for k = 1,...,n. Therefore, defining recursively polynomials p,; of 
degree n — k by applying the Horner scheme to the preceding polynomial 
pr—1 leads to 


p')(z) = kipk(z), k=1,....n 
We can summarize this in the following theorem. 


Theorem 6.22 Let 


—2 


p(x) = aga” +a,2"' + aga"? +--- + an_12 + an 


be a polynomial of degree n. For z € € the complete Horner scheme 
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contains the derivatives 


(k) 
k p”)(z) 


of the polynomial p at the point z. The scheme ts recursively defined by 
oP = Am, M=O0,...,n, and 


BM) = BEY) otk) = zo) 4 oA), mal,...,n—k, 


fork =0,...,n. 

Example 6.23 For the polynomial p(x) := 2° — 2? + 3a — 5 the Horner 
scheme 

for z = 2 leads to p(2) = 5, p'(2) = 11, p”(2) = 10, p’"(2) = 6. O 


We continue by outlining how to compute all the zeros of a polynomial 
p of degree n with real coefficients. We first assume that p has only simple 
real zeros and proceed as follows: 

1. Either from analytic considerations or by plotting a graph of the 
polynomial we obtain a rough estimate of the location of the zeros 
Zn < 2n-1 “°° S22 < 2. 

2. Starting with some 2p > 2z,, by Newton iteration we compute the 
largest zero z,. The global convergence of Newton’s method in this 
case follows from monotonicity arguments (see Problem 6.13). 

3. By the Horner scheme we divide p by the linear factor z—z, and carry 
out step two for the reduced polynomial to compute z2. Repeating 
this procedure, we successively obtain approximations for all zeros. 

4, In order to improve the accuracy, for all zeros Newton’s method is ap- 
plied to the full polynomial p with the starting points of the iteration 
given by the approximations obtained in step three. 
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Now we consider the case of multiple real zeros. If z is a zero of order m, 
then we can write 
p(x) = (x — z)"q(zx), (6.12) 
where the polynomial q of degree n — m has a value q(z) 4 0. To see the 
effect of (6.12) on Newton’s method we consider it as a fixed-point iteration 
Zy41:= g(x_,) with g defined by 


g(x) = & - 


Using (6.12), by elementary differentiation we obtain 


1 
/ = 1 — _— e 
g (z) - 


Therefore, by Theorem 6.2, at a multiple zero Newton’s method is locally 
convergent. Obviously, the convergence at a multiple zero is only linear. 
However, one can modify Newton’s method for multiple zeros such that 
the quadratic convergence is preserved (see Problem 6.14). 

For finding complex zeros, in principle one can apply Newton’s method 
in C. For this one has to keep in mind that for polynomials with real coef- 
ficients, the starting values need to be complex, since otherwise Newton’s 
method would produce only real approximations. For the conjugate com- 
plex zeros of a polynomial with real coefficients Bairstow’s method avoids 
working in the complex plane by using the fact that for two conjugate zeros, 
the product of the linear factors (2 — z)(x — Z) is a polynomial of degree 
two with real coefficients. The basic idea is to write the polynomial p of 
degree n in the form 


p(x) = (x* — ux — v)q(x) + a(x —u) +8, 


where g is a polynomial of degree n—2, and a and b are constants depending 
on u,v € R. The factor x? — ux — v corresponds to two conjugate complex 
zeros Of p if the pair u, v solves the nonlinear system a(u, v) = 0, b(u,v) = 0. 
The latter can be solved by Newton’s method, and once the solution wu, v is 
known, the two zeros of p are obtained by solving the quadratic equation 
z* —ur—v =0. 

We conclude this section with some consideration of the question of sta- 
bility. In particular, we show that the zeros of polynomials can be quite 
sensitive to small changes in the coefficients even if all the zeros are simple 
and well separated from each other. 

Let p and q be polynomials of degree n and assume that zo is a simple 
zero of p. Consider the perturbed polynomial 


p(-,€) :=pteq, 


where € is small. Using the theory of functions of a complex variable, it can 
be shown that in a neighborhood of ¢ = 0 the zero z(e) depends analytically 
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on the parameter ¢. The derivative z’ can be obtained by differentiating 
plz(€),¢] = 0 with respect to e. This yields 


{p'lz(e)] + eq'[z(e)]}2"(e) + alz(e)] = 0, 


and setting ¢ = 0, it follows that 


2! 0 — _ q(zo) ; 
() = Go) 
Hence, for small ¢ we have that 
q(zo) 
z(é) % ze . 6.13 
(€) 0 p' (20) (6.13) 


Example 6.24 The polynomial 
p(x) := (x — 1) (x — 2)---(x — 10) = 2!” — 55n° +--- +10! 


has the zeros 1, 2,...,10, which are well separated from each other. We 
perturb the coefficient of x? by choosing q(x) := 552°. Since p’(10) = 9}, 
by (6.13), the zero z = 10 of the polynomial p is perturbed into 


55 - 10° 
10-7 ® 10-15. 10°. 
This illustrates that finding the zeros of p is an ill-conditioned problem and 
that a reliable approximation of the zeros is impossible. 0 


6.4 Least Squares Problems 


Quite often the problem of solving a system of nonlinear equations may 
be replaced by an equivalent problem of minimizing a function and vice 
versa. We illustrate this by introducing the Levenberg—Marquardt method 
as one of the most effective procedures for solving nonlinear least squares 
problems. 

Let g : IR” — R be a twice continuously differentiable function and 
consider the problem of minimizing g. Let zo be an approximation for a 
local minimum of g. In a neighborhood of zo, by Taylor’s formula we may 
approximate 


g(x) % g(x) + («& — 20)" grad g(xo) + : (x ~ to)" 9"(ao)(x — to), (6-14) 
where 


079 
re (of 
) OL ;OL j,k=1,. 


oN 
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denotes the Hessian matriz of g. Minimizing the quadratic function on the 
right-hand side of (6.14) yields 


£1 = £9 — [g"(ao)]~* grad g(xo) (6.15) 


as a new approximation for the minimum of g. We observe that (6.15) ob- 
viously coincides with one Newton step for solving the necessary condition 
grad g(x) = 0 for a local minimum. 

However, if (6.14) is only a very poor approximation to g, then we expect 
the Newton step (6.15) not to be very effective. In this case it is more 
appropriate to use a so-called method of steepest descent; i.e., choose 


L1 = Lo — AM grad g(z0) (6.16) 


aS a new approximation. Here M is a positive definite matrix, and the 
step size \ > 0 is chosen such that g(z1) < g(zo) is satisfied. This can be 
achieved, since by Taylor’s formula we have that 


g(xo — AM grad g(xo)| © g(xo) — Algrad g(xo)|’ M grad g(zo) 


and M is assumed to be positive definite. 
After introducing the vector y € IR” and the n x n matrix A by 


y; (x) := Bn, (x), ajn (x) = icin (x), (6.17) 


we can rewrite the Newton iteration (6.15) as the linear system 


A(Xo)(@1 — Zo) = y, (6.18) 


which we have to solve for the difference 7; — x9. Similarly, one step of the 
steepest descent (6.16) can be transformed into 


Ly, — Lo = AM y. (6.19) 


Now recall the least squares problem of Example 2.4. In a slight refor- 
mulation, this problem consists in minimizing the function 


g(x) = > Nfi(a) — uj]? 


over some domain D, where the f; : D — IR are given functions and the 
ui; € IR are given constants for 1 = 1,...,m. We compute the derivatives 


5) =2 0 1K@ ui] 5 
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and 


"9 _ — f Of; Ofi O° fi 
50a, (x) =2 d 15 (x) Aa, (x) + [fi(x) — ui] dx jOn, @)} 


In this case the matrix (a;,) contains second derivatives of the functions 
f;. However, since these derivatives are multiplied by the factor [f;(2) —u;], 
which will become small by minimizing g, it is justified to neglect this term. 
Note that if Newton’s method converges, it always will converge to a zero, 
even if we do not use the exact Jacobian for the computation, provided that 
the approximate Jacobian at the limit is nonsingular. Hence, we simplify 
and replace (6.17) by 


OF ) OF 
azn (£ =2)) 1 30, ) Fare (x) (6.20) 


and note that a;;(x) > 0. 
Now the Levenberg-Marquardt method combines (6.18) and (6.19) by 
first introducing the n x n matrix A = (G@;,) with entries 


Ajj = (1 + 1) 555 Aik = Qjk, j fF k, 


where y is some positive parameter, and then replacing (6.18) and (6.19) 
by 
A(%o)(%1 — Zo) = y. (6.21) 


Obviously, for large the matrix A will become diagonally dominant, and 
(6.21) will get close to the steepest descent, with 


M = diag (— yeeey —} 
Q11 Qnn 
and A = 1/7. For y > 0, on the other hand, (6.21) will turn into the Newton 
step (6.18). This ability to gradually vary between Newton’s method and 
the steepest descent method is one of the basic features of the Levenberg- 
Marquardt method, which we describe as follows: 
1. Choose an initial guess x9, some moderately sized value for y, and a 
factor a, say y = 0.001 and a = 10. 
2. Solve the linear system (6.21) to obtain 7}. 
3. If g(x1) > g(azo), then reject x; as a new approximation, replace y by 
ay, and go back and repeat step two. 
A. If g(x1) < g(zo), then accept xg as a new approximation, replace Xo 
by x; and y by y/a, and go back to step two. 
5. Terminate when the difference |g(z1) — g(zo)| is smaller than some 
given tolerance. 
For a detailed analysis of this method we refer to [44]. For a study of 
nonlinear optimization methods and their relation to nonlinear systems we 
refer to [20]. 
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Problems 


6.1 Prove Brouwer’s fized point theorem in JR; i.e., show that if D C Ris a 
closed and bounded interval and if f : D > D is continuous, then f has a (not 
necessarily unique) fixed point. 


6.2 Draw figures illustrating monotone or alternating divergence of the succes- 
sive iterations for a fixed point of a function of one variable. 


6.3 Show how to solve the equation tan x = x by successive approximations. 


lim V2+ V24+---+V2=2. 
\rnennanmn, semenenemsnmeme” 


6.4 Show that 


v—-+Oo 


vy square roots 


6.5 Let DC R be an open interval and let f : D > D be m times continuously 
differentiable. Under the assumption that the sequence 2,41 := f(x_) converges 
to some x in D with f'(z) = f"(z) = --- = f'"~-(z) = 0, show that the 
convergence is of order m. 


6.6 Let the sequence (z,) in IR converge to x such that x, # « for all v € IN 
and 
fy41-x2=(q+&,)(x, — 2), y=0,1,..., 


where |q| < 1 and £, — 0, v > oo. Show that 


(tv41 — tv)” 


Yo = Ly — — 
Lv+2 —_ 22 y41 + Ty 


is well-defined for sufficiently large v and that 


. y—wZ 
lim y 
v—-oo Ly — @& 


= Q; 


i.e., the sequence (y,) converges to x more rapidly than the sequence (2,1). 
This method for speeding up the convergence of sequences is known as Aitken’s 
6°? method. 


6.7 Let D C R be an open interval, let f : D - IR be twice continuously dif- 
ferentiable, and let x be a fixed point of f with f’(z) # 1. Show that Steffensen’s 
method F(a) i 

f Lv —_ Ly 
Lyd. = Ly - Soe _.,._dOd V = 0,51, 
F{f(xv)] — 2f (av) + ay 


is locally and quadratically convergent to the fixed point z (see Problem 6.6). 


e289 


6.8 Discuss Steffensen’s method of Problem 6.7 for the fixed point x = 0 of the 
function f(x) := 22 4 2°. 


6.9 Show that 
ry (x2 + 3a) 


322 +4 
is a method of order three for computing the square root of a positive number a. 


Lv41:= , v=0,l,..., 
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6.10 Prove an analogue of Corollary 6.16 for the secant method (6.9). 


6.11 Give conditions for monotone convergence of Newton’s method for a func- 
tion of one variable. 


6.12 Show that Newton’s method for the function f(z) := 2” —a, x > 0, where 
n> 1 and a> 0, converges globally to a!/”. 


6.13 Assume that the polynomial p with real coefficients has only real zeros 
and denote the largest zero by z1. Show that for any initial point xo with ro > 21 
Newton’s method converges to 21. 


6.14 Assume that z is a zero of order m of the polynomial p. Show that 


_ am Pie) 
p' (xv) 


Lv41 i= Ly , v=0,l,..., 


converges locally and quadratically to the zero z. 
6.15 Show that for a nonsingular n x n matrix A the sequence 
Av+1 := A,[2I-— AA,], v=0,1,..., 
converges quadratically to the inverse A~’, provided that ||J ~ AAol] < 1. 


6.16 Write a computer program for finding n simple zeros of a polynomial of 
degree n with real coefficients. Use this code for the computation of the zeros of 
the Laguerre polynomial L4(xz) = 24 — 162° + 7227 — 962 4+ 24. 


6.17 Show that for the function f : (0,00) > R given by 
In2 . l 
f(x) := —~ sin (2x 7) +1 
T n 
the Newton iterations starting with zo = 1 converge and that the limit, however, 
is not a zero of f. 


6.18 The eigenvalue problem Az = Ax for an n x n matrix A is equivalent to 
the equation f(z) = 0, where f : IR” x R— IR” x BR is defined by 


f: xe\ Ax — rx 
“\ X rex? —1 }° 
Write down Newton’s method for this equation. 


6.19 Write a computer program for solving a least squares problem by the 
Levenberg—Marquardt method. 


6.20 The set of all points ¢ € C for which the fixed point iteration z,41 := 22+¢ 
starting with z. = 0 remains bounded is called the Mandelbrot set. Write a 
computer program for visualizing the Mandelbrot set. 


(( 


Matrix Eigenvalue Problems 


Many problems in science and engineering lead to eigenvalue problems for 
matrices. These occur either directly or by discretization of eigenvalue prob- 
lems for differential or integral operators. In the latter case the size of the 
matrices will be rather large. It is the purpose of this chapter to intro- 
duce some of the main ideas in matrix eigenvalue computations without 
attempting to be comprehensive. For a more detailed study we refer to 
(27, 65]. 

For the numerical computation of matrix eigenvalues we have to distin- 
guish between two groups of methods: 

1. In the so-called direct methods the eigenvalues are obtained as zeros 

of the characteristic polynomial. 

2. In contrast, iterative methods approximate the eigenvalues through a 
successive approximation procedure without using the characteristic 
polynomial. 

Since, as illustrated in Example 6.24, the computation of zeros of poly- 
nomials of high degree tends in general to be ill-conditioned, in practice it- 
erative methods are used almost exclusively. In this chapter we will discuss 
the two most important methods of this class, namely the Jacobi method 
and the QR algorithm. In the last section we will also briefly describe the 
Hessenberg method as an example of a direct method. 

A key factor in all eigenvalue computations is the fact that similarity 
transformations leave the eigenvalues of a matrix invariant; i.e., for a given 
matrix A the matrices A and C~!AC have the same eigenvalues for all 
nonsingular matrices C’. This can be seen either from the equivalence of 
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the equations 
At=x and (C~'AC)C7'2 = dC™'2 
or from the multiplication theorem for determinants 
det(AI — A) = det[C~'(AI — A)C] = det(AI — C7' AC); 


i.e., similar matrices have the same characteristic polynomial. This invari- 
ance allows one to transform a given matrix A by a similarity transfor- 
mation into a matrix of simpler form with the same eigenvalues as A. In 
particular, the iterative methods successively construct sequences of similar 
matrices that converge to a diagonal matrix or an upper (or lower) trian- 
gular matrix from which the eigenvalues can be read off as the diagonal 
elements. 


7.1 Examples 


We begin by illustrating how the discretization of eigenvalue problems for 
differential operators leads to eigenvalue problems for large matrices. 


Example 7.1 The vibrations of a string are modeled by the so-called wave 
equation 

Ow 1 Ow 

Ox? ?-_— Ot?’ 
where w = w(z,t) denotes the vertical elongation and c is the speed of 
sound in the string. Assuming that the string is clamped at x = 0 and 
x = 1, the boundary conditions w(0,t) = w(1,t) = 0 must be satisfied for 
all times t. Obviously, the time-harmonic wave 
twt 


w(az,t) = v(x)e 


with frequency w solves the wave equation, provided that the space-dependent 


part vu satisfies 
—v" =v on (0, 1], 


where A := w*/c?. The boundary conditions w(0,t) = w(1,t) = 0 are 
satisfied if v satisfies the boundary conditions 


v(0) = v(1) = 0. 
Hence, introducing the linear space 
U := {v € C[0,1] : v is twice continuously differentiable, v(0) = v(1) = 0} 


and defining the differential operator D : U > C[0,1] by D: v  —v", 
we are led to the eigenvalue problem Dv = Av. Elementary calculations 
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show that the functions v,,(x) = sin m7z are eigenfunctions of D with the 
eigenvalues A,, = m?n? for m = 1,2,.... It can be shown that these are 
the only eigenvalues and eigenfunctions of D. 

For discussing an approximate solution we consider the slightly more 
general differential equation 


—y" + pv =v on [0,1] 


with boundary conditions v(0) = v(1) = 0, where p € C0, 1] is a given pos- 
itive function. We can proceed as in Example 2.1 and choose an equidistant 
mesh x; = jh, 7 = 0,...,n +1, with step size h = 1/(n+1) andn € N. At 
the internal grid points z;, 7 = 1,...,n, we replace the differential quotient 
by the difference quotient 


1 
ui (aj) & Fy {u(aj41) — 2v(aj) + v(2j-1)} 
to obtain the system of equations 
1 ; 
pat Uj-i t 2vy — Visi} + DIY; = Xv;, 7=1,...,N, 


for approximate values v; to the exact solution v(z;). Here, we have set 
pj := p(z;) for 7 = 0,...,n +1. This system has to be complemented by 
the two boundary conditions v9 = Up+1 = 0. For an abbreviated notation 
we introduce the n x n tridiagonal matrix 


2+ h2n, —] 
—1 2+ h* po —] 
1 —1 2+ h*p3 —1 
—1 2+h?pp-s —1 
—] 2+ hp, 
and the vector u = (v,...,Un)’. Then the above system of equations, 
including the boundary conditions, reads 


Au = Au; 


i.e., the eigenvalue problem for the differential operator D is approximated 
by the eigenvalue problem for the matrix A. O 


The important question as to how well the matrix eigenvalues approx- 
imate the eigenvalues of the differential operator and whether we have 
convergence of the eigenvalues as h — 0 is beyond the scope of this book 
(see Problem 7.2). The example is meant only as an illustration of the fact 
that eigenvalue problems for large matrices arise through the discretiza- 
tion of eigenvalue problems for ordinary differential operators and also for 
partial differential operators. In the same spirit, eigenvalue problems for 
integral operators can be approximated by matrix eigenvalue problems, as 
indicated in the following example. 
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Example 7.2 Consider the eigenvalue problem 


[ Kenedy = role), 2 € [0,1] 


for a linear integral operator with continuous kernel K. For the numerical 
approximation we proceed as in Example 2.3 and approximate the integral 
by the rectangular rule with equidistant quadrature points x, = k/n for 
k = 1,...,n. If we require the approximated equation to be satisfied only 
at the grid points, we arrive at the approximating system of equations 


— 
— > K (aj, te) ee = Av;, j=,...5n, 
k=1 


for approximate values y; to the exact solution y(z;). Hence, we approx- 
imate the eigenvalues of the integral operator by the eigenvalues of the 
matrix with entries K(z;,2;)/n. Of course, instead of the rectangular rule 
any other quadrature rule can be used. A discussion of the convergence of 
the matrix eigenvalues to the eigenvalues of the integral operator is again 
beyond the aim of this introduction. O 


7.2 Estimates for the Eigenvalues 


At this point we urge the reader to recall the basic facts about eigenvalues 
of matrices, in particular those that were presented in Section 3.4. In the 
sequel, by (-,-) we denote the Euclidean scalar product in €” and by || - |[2 
the corresponding Euclidean norm. 

The eigenvalues of Hermitian matrices can be characterized by the fol- 
lowing maximum principles. These can be used to get some rough estimates 
for the eigenvalues. Note that for the eigenvalues of Hermitian matrices the 
geometric and the algebraic multiplicity coincide (see Problem 7.4). 


Theorem 7.3 (Rayleigh) Let A be a Hermitian n xn matriz with eigen- 


values 


(where multiple eigenvalues occur according to their multiplicity) and cor- 


responding orthonormal eigenvectors £1, X2,...,2n. Then 
Ax, x 
D’ = max | ) j= 1,...,n, 
reV; (£,2) 
z AO 


where the subspaces Vi,...,Vn are defined by V,; := C” and 


V;:= {2 EC”: (2,24) =0, k=1,...,7 — Lf, j = 2,...,0. 
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Proof. Let x € V; with x # 0. Then 


c= S (2, tk) xk and 3 \(x, x%)|? = (a, 2). 


k=j k=j 
Hence 


Agr = S- Ap (£, LE) LE 


k=j 
and 


(Aa, x) = D> Ax|(a, ee)|? < A; D> |(z, ee)? = Aj(z, 2). 


This implies 


Az,z 
rEV; (x, zr) 
rZ0 
and the statement follows from (Az;,z;) = A; and a; € Vj. Oo 


This maximum principle can be used in a simple manner to obtain lower 
bounds for the largest eigenvalue of Hermitian matrices. For the matrix 


1 3 2 
A={ 3 5 1 ], 
2 1 4 


by using x = (1,1,1)? we find the estimate \; > 7.33 as compared to the 
exact eigenvalue \; = 7.58.... Using x = (1,2,1)? leads to the estimate 
Ai > 7.50. 

Using Rayleigh’s principle to obtain bounds for the smaller eigenvalues 
- requires the knowledge of the eigenvectors for the preceding larger eigen- 
values. This problem is circumvented in the following minimum maximum 
principle. 


Theorem 7.4 (Courant) Let A be a Hermitian n x n matrix with eigen- 
values 


Ay > A2 > -°: > An 


(where multiple eigenvalues occur according to their multiplicity). Then 


(Az, x) 
Aj =_min max ; 
U;EM; EU; (2X, 2z) 
x0 


g=1,...,n, 


where M; denotes the set of all subspaces U; C ©” of dimension n + 1 —j. 
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Proof. First we note that because of 


Ag, 

sup (Az, 2) = sup (Az,z) 
reu; (x, x) 2eu; 

20 (z,2)=1 


and the continuity of the function zc + (Az, zx), the supremum is attained; 
i.e., the maximum exists. 

By 21,2%2,...,2, we denote orthonormal eigenvectors corresponding to 
the eigenvalues 43; > Ag > --: > Ag. First, we show that for a given 
subspace U; of dimension n + 1 — 7 there exists a vector x € U; such that 


(v,a%)=0, kK=j+1,...,n. (7.1) 
Let 21,...,2%n41-j be a basis of U;. Then we can represent each x € U; by 


n+1— 7 


r= »» Qj 2;. (7.2) 


In order to guarantee (7.1), the n + 1 — 7 coefficients a1,...,an41-; must 
satisfy the n — 7 linear equations 


n+1—j 
S- aj(zj,2.)=0, kK=jt+l,...,n. 
i=1 


This underdetermined system always has a nontrivial solution. For the 
corresponding x given by (7.2) we have x # 0, and from 


j 
t= S (2, Le )Lk 
k=1 


we obtain that 


j j 
(Ax, x) = 5° Ag|(a, ee)|? > Aj d_ |(w, eK)? = Ag (a, 2), 
k=1 k=1 


whence 
max (Az, 2) 
zeUu; (x, 2) 4 
x0 

follows. 


On the other hand, for the subspace 
U; ={xeC”": (z,2,) = 0, K=1,...,7-1} 
of dimension n + 1 — 7, by Theorem 7.3 we have the equality 
(Az,r) _ 


max = 
rEu; (z, x) 
zH#0 


q3 


and the proof is finished. Oo 
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Corollary 7.5 Let A and B be two Hermitian n x n matrices with eigen- 
values 4,(A) > A2g(A) > --- > An(A) and A1(B) > A2(B) > --- > An(B). 
Then 


for any norm || - || on C”. 
Proof. From the Cauchy—Schwarz inequality we have that 
(Az — Bz, x) < ||(A — B)z|l2 ||z\l2 < ||A — Blle llzll3 


and hence 
(Az,x) < (Ba, x) + ||A — B]le ||a|[3. 


By the Courant minimum maximum principle of Theorem 7.4 this implies 
\j(A) <Aj(B) + ||A- Bla, jal..sn 
Interchanging the roles of A and B, we also have that 
\j(B) <dj(A) +||B- Alla, j= 1...n 
and therefore 
JAj(A) — Ag(B)| < |]A- Bll, g=1,.-.,n. 
Now the statement follows from 
||A — Bll2 = p(A — B) < ||A — BI, 
which is a consequence of Theorems 3.31 and 3.32. O 


Corollary 7.6 For the eigenvalues 41 > Ap > --- > An of a Hermitian 
nxn matriz A = (a;,) we have that 


Ai — ai, |? < > Jaye i=1,...,n, 


j,k=l1 

i#k 
where the elements a},,...,@,,, represent a permutation of the diagonal 
elements @11,.-.,Qnn Of A such that a\, > a5. >---> al, 
Proof. Use B = diag(a};) and || - || = || - |2 in the preceding corollary. O 


We conclude this section with an extension of the above results to general 
matrices that gives a rough estimate as to where in © the eigenvalues are 
located. 
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Theorem 7.7 (Gerschgorin) Let A = (a;,) be a compler n x n matriz 
and define the disks 


n 
AEC: |A-a5;| < d- laje| », jH=l,...,n, 


G; — 
k=1 
k#j 

and 

n 

Gi:= AEC: |A-aj;| < > laxs| , jHl,...,n. 
k=1 
k#j 


Then the eigenvalues of A satisfy 
n n 
AE U G; a) U Gj. 
j=1 j=l 


Proof. Assume that Ax = Az and ||z||,,. = 1, and for z = (m,...,%n)* 
choose j such that |x;| = ||z||.. = 1. Then 


nr nr 
JA = a55| = |(A — ayy)aj] =| > ajntel < >> layel, 
k=1 k=1 
kAj kFj 
and therefore 


AE v G;. 
j=1 


Since the eigenvalues of A* are the complex conjugate of the eigenvalues of 
A (see Problem 7.3) we also have that 


rr 
JE LG, 
j=l 


and the theorem is proven. 0 


7.3. The Jacobi Method 


The method described in this section was discovered by Jacobi in 1846 and 
can be used to iteratively compute all the eigenvalues and eigenvectors of 
real symmetric matrices. 
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Lemma 7.8 The Frobenius norm 


1/2 
nm 


Alle = { >> lajel? 


j,k=1 
of ann xn matriz A = (a;x) ts invariant with respect to unitary transfor- 


mations. 


Proof. The trace 
Tm 
tr A := S- Q;; 
j=1 


of a matrix A is commutative; i.e., tr AB = tr BA. This follows from 


n 


> (AB) 55 = D0 ajnbay = > > bai Qjk = Y (BAe 


j=l j=l k=1 k=1 j=1 


In particular, we have that 
trAA* = >) ajeag; = >> >> lage’. 
j=l k=1 j=l k=1 
Therefore, for each unitary matrix Q it follows that 
|Q* AQ||z = tr(Q* AQQ* A*Q) = tr(Q* AA*Q) = tr(AA*QQ*) = ||Allz, 
and the lemma is proven. oO 


Corollary 7.9 The eigenvalues of ann x n matriz A (counted repeatedly 
according to their algebraic multiplicity) satisfy Schur’s inequality 


YA? < |All. 


j=1 
Equality holds if and only tf the matriz A is normal, i.e., if AA* = A* A. 


Proof. By Theorem 3.27 there exists a unitary matrix Q such that 
R := Q* AQ is an upper triangular matrix. Hence 


Alle = ||Rllp = 3 Ag? + y > Irie)”, (7.3) 


j=1 k=j+1 


since the diagonal elements of R = (r;,) coincide with the eigenvalues of 
the similar matrices R and A. Now Schur’s inequality follows immediately 
from (7.3). 
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For the discussion of the case of equality, we first note that any unitary 
transformation of a normal matrix is again normal. This is a consequence 
of the identity 


Q* AQ(Q* AQ)* — (Q*AQ)"Q* AQ = Q*(AA* — A*A)Q. 


If equality holds in Schur’s inequality, then (7.3) implies that R is a diagonal 
matrix. Hence A, and therefore A, is normal. 

Conversely, if A is normal, then the upper triangular matrix R must also 
be normal. Now, from 


nr n 
(RR*)j5 = do ryeriég = >, Iryel? 
k=j 


k=1 


and . 
j 


nr 
(R*R) 35 = dorjareg = d— Iregl? 
k=1 


k=1 


we conclude that 
n j 
So lrie? = So lragl?, GH 1,-0.5. 
k=) k=1 


This implies r;, = 0 for 7 < k, i.e., R is a diagonal matrix, and from (7.3) 
we deduce that equality holds in Schur’s inequality if A is normal. O 
For any n x n matrix A = (a;,) we introduce the quantity 


1/2 
nm 
N(A) = |S layel? (7.4) 
jyk=1 
tk 
as a measure for the deviation of A from a diagonal matrix. 


Lemma 7.10 Normal matrices A satisfy 
mr nr 
SIA? = >5 fags? + [N (ADP. 
j=l j=l 
Proof. This follows from Corollary 7.9. O 
The main idea of the Jacobi method for real symmetric matrices is to 
successively reduce N(A) by elementary plane rotation matrices such that 


in the limit the matrix becomes diagonal (with the eigenvalues as diagonal 
entries). 
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Lemma 7.11 For each pair j < k and each yp € R the matriz 


1 
COs ( — sin y 
U= ; 
sin yp cos Y 
1 

which coincides with the identity matrix except for uj; = Uke = cosy and 
Ukj = —Ujk = siny (and which describes a rotation in the x;x,-plane) is 
unitary. 


Proof. This follows from 
cosy —siny cosp sny \ /1 O 
siny cosy —siny cosy / \ 0 1 


cosp sing cosy —sing \ /1 0O 
—siny cosy siny cosy Lo 1.’ 
Lemma 7.12 Let A be a real symmetric matrix and let U be the unitary 


matriz of Lemma 7.11. Then B = U* AU is also real and symmetric and 
has the entries 


and 


_ 2 a - 2 
Aj; COS” p + Azz SiN 2p + Age SiN” —, 


o 
fo, 
me. 

| 


bk = Qj; sin? yp — aj, Sin 2p + Ake cos? yp, 


1 . 
D5 = bn; = Ajk COS 2p + 5 (Qkk — a;;) sin 2y, 


bij = bj; = aijcosp+aysiny, 174 j,k, 
bin = bai = —ayjsinp+ajycosy, 1#9,k, 
by = ay, 1,14 9,k; 
i.e., the matriz B differs from A only in the 7th and kth rows and columns. 


Proof. The matrix B is real, since A and U are real, and it is symmetric, 
since the unitary transformation of a Hermitian matrix is again Hermitian. 
Elementary calculations show that 


cosp  siny Qjj jk cosp —siny \ __ [ bj;  djz 
—siny cosy Akj kk siny cosy Nba Oke 
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with b;;, bj4, 643, and by, as stated in the theorem. For i # j,k we have 
that 


n 


_ * _ _ . 
bi — 5 } UjgAsrUrj = AjijUjj + AinUej = Aj; COSY + Aix SIN Y 
r,s=1 
and 
n 
* . 
bi. = ) Uj,AsrUrk = AijUjk + GikUek = —Qi; SIN Y + AiR COS Y. 
r,s=1 


Finally, we have 


n 
* 
bi = S Uj; sAsrUrl = Ail 


r,s=1 
for 1,1 4 37,k. O 
Lemma 7.13 For 
2a; 
tan 2p = Ik , Ajj F Okk, 
Qjj — akk 
T 
P= 4 ) Qj; = akk, 


the transformation of Lemma 7.12 annihilates the elements 
bj~ = bg; = 0 
and reduces the off-diagonal elements according to 
[N(B)]* = [N(A)]’ — 2a%,. 


Proof. bj, = bg; = O follows immediately from Lemma 7.12. Applying 
Lemma 7.8 to the matrices 


( wt and ( 5 | 
pj kk bj One 


a‘, + 2a*, + az. = b + b?.. 
From this, with the aid of Lemmas 7.8 and 7.12 we find that 


yields 


[N(B)]? = ||BIIZ — >¢ 63, = |All — 5-8, 
1=1 i=1 


= [N(A)]? + dai — bi,) = [N(A)]* - 2054, 


which completes the proof. 0 
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Note that the quantities required for the computation of the elements of 
the transformed matrix can be obtained by the trigonometric identities 


1 


\/1 + tan? 2y | 


cosp = 4/ (1 + cos 2y), sinp = 4/ (1 — cos 2p). 


The sign of the root in the expression for sin y has to be chosen such that 
it coincides with the sign of tan 2y. 

The classical Jacobi method generates a sequence (A,) of similar matri- 
ces by starting with the given matrix Ag := A and choosing the unitary 
transformation at the vth step according to Lemma 7.13 such that the non- 
diagonal element of A,_; with largest absolute value is annihilated. It is 
obvious that the elements annihilated in one step of the Jacobi iteration, 
in general, do not remain zero during subsequent steps. However, we can 
establish the following convergence result. 


cos 2y = 


bo] 
eo Be 


Theorem 7.14 The classical Jacobi method converges; i.e., the sequence 
(A,) converges to a diagonal matrix with the eigenvalues of A as diagonal 
elements. 


Proof. For one step of the Jacobi method, from 


IV(A)? < (x? =n) max af 
, = yon 


we obtain that 
2 > [N(AJP 
jk — n(n — 1) 


for the nondiagonal element a;, with largest modulus. Hence, from Lemma 
7.13 we deduce that 


[N(B)]° = [N(A)}’ — 205, < a°[N(A)), 


For the sequence (A,) this implies that 


where 


N(Ay) < q° N(Ao) 


for all v € IN, whence N(A,) > 0, v > oo, since q < 1. oO 
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Note that for large n the value of q is close to one, indicating a slow 
convergence of the Jacobi method. Writing A, = (a;x%,,) by Corollary 7.6 
we have the a posteriori error estimate 


|Aj — ajjv| < N(AL), jg =1,...,n, 


after performing v steps of the Jacobi method. Further error estimates can 
be derived from Gerschgorin’s Theorem 7.7. 

Approximations to the eigenvectors can be obtained by successively mul- 
tiplying the unitary transformations of each step. We have A, = Q* AQ._, 
where Q, = U, ---U, is the product of the elementary unitary transforma- 
tions for each step. From 


A, & D = diag(\1,-.-,An) 


it follows that AQ, ~ Q,D. Hence the columns Q, = (u1,...,Un) of Qy 
satisfy Au; + Aju; for j = 1,...,7; 1.e., they provide approximations to 
the eigenvectors. 

In each step, the classical Jacobi method requires the determination of 
the nondiagonal element with largest modulus. In order to reduce the com- 
putational costs, in the cyclic Jacobi method the nondiagonal elements are 
annihilated in the order 


(1,2),...,(1,n), (2,3),..., (2, n), (3, 4),...,(n —1,n) 


independent of their size. Convergence results can also be established for 
this variant (see [27]). 

A further refinement is to choose a constant threshold and to annihilate 
in each cyclic sweep only those off-diagonal elements that are larger in 
absolute value than the threshold. Of course, the threshold needs to be 
lowered after each sweep, i.e., after performing a full cycle. For details we 
refer to [48, 65]. 


Example 7.15 For the matrix 


2 —-1 0 
A=| -l 2 —-1 
0 -l 2 


the first six transformed matrices for the classical Jacobi method are given 
by 
1.0000 0.0000 -0.7071 
Ai = 0.0000 3.0000 —0.7071 |, 
—0.7071 —0.7071 2.0000 


0.6340 —0.3251 0.0000 
Ag = { —0.3251 3.0000 —0.6280 |, 
0.0000 —0.6280 2.3660 
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0.6340 —0.2768 —0.1704 
A3 = | —0.2768 3.3864 0.0000 |, 
0.1704 0.0000 1.9796 
0.6064 0.0000 —0.1695 
A, = 0.0000 3.4140 0.0169 j, 
0.1695 0.0169 1.9796 
0.5858 0.0020 0.0000 
A; = 0.0020 3.4140 0.0168 |, 
0.0000 0.0168 2.0002 
0.5858 0.0020 —0.0000 
Ag = 0.0020 3.4142 0.0000 |. 
—0.0000 0.0000 2.0000 


The exact eigenvalues of A are Ay = 2+ V2, ro = 2, Ag = 2- V2. oO 
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The QR algorithm was suggested by Francis in 1961 and is an iterative 
method for computing all eigenvalues and eigenvectors for arbitrary com- 
plex matrices. In applications, it is the most commonly used method for 
eigenvalue computations. Our presentation of the QR algorithm follows 
[62]. 

For motivation we first consider the power method introduced by von 
Mises in 1929 for finding the eigenvalue with largest modulus. 


Definition 7.16 A matriz A is called diagonalizable if there ezists a non- 
singular matriz C such that C-! AC is a diagonal matriz; i.e., A is similar 
to a diagonal matriz. 


Theorem 7.17 Ann xn matriz A is diagonalizable tf and only if it has 
n linearly independent eigenvectors. 


Proof. Assume that C-!AC = D, where D = diag(A1,..., An), is diagonal. 
Then De; = A;e;, 7 = 1,...,n, with the canonical orthonormal basis 
€1,-.-,€n of ©”. This implies that the vectors x; := Ce;, 7 =1,...,n, are 
eigenvectors of A, since 


Az; = ACe; = CDe; = CrAje; = Aj2;- 
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The vectors 21,...,2% are linearly independent because C' is nonsingular 
and the e;,...,€, are linearly independent. 

Conversely, assume that 21,...,2n are n linearly independent eigenvec- 
tors of A for the eigenvalues A,,...,An. Then the matrix C = (41,...,2n) 
formed by the eigenvectors as columns is nonsingular, and we have that 


AC = (Az1,..., Arn) — (A1%1,---,;AnZn) = CD, 


where D = diag(\i,..., An). Hence C~!AC = D. Oo 


We order the eigenvalues of a diagonalizable n x n matrix A according 
to their absolute values and assume that 


[Ar | > |A2| > |As| > --- > [An|; 


i.e., there is only one eigenvalue of maximal modulus. Starting from an 
arbitrary vector v9 € ©” we construct the sequence 


vy = AX’up, v=1l,2,..., 
by the successive iterations v, := Av,—,. Note that in order to avoid nu- 


merical overflow or underflow we need to scale after each step. Since the 
n linearly independent eigenvectors 71,...,27 of A form a basis of C”, we 


can represent 
nm 
vo = ) Ark, 
k=1 


whence 


nr 
A’ug = ) AnrALL k 
k=1 


follows. Scaling after each step by the factor 1/2; leads to 


A”’v9 _ * Xk Y 
dv = So ax () Lk, 


and consequently 


A” vo [ov+alle 


—> |Ax| 
1 [lv |l2 


as vy — oo, provided that a, # 0. Of course, in principle, A; cannot be 
used as a scaling factor, since it is not known. However, this is irrelevant, 
since the eigenvector is determined only up to multiplication by a complex 
constant; i.e., only the direction of the eigenvector is relevant. In practical 
computations, the condition a; # 0, i.e., vo ¢ span{r2,...,¢%n}, will be 
automatically satisfied through roundoff errors. 
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The fact that we need to find only the direction of the eigenvectors 
motivates us to interpret the power method as a successive iteration of 
subspaces. For 


S :=span{up} and A”S =span{A” vp} 


from the above we have that A”S — span{z,}, vy — oo. More generally, 
we can choose any subspace S of dimension 1 < dimS < n and iterate 
A’S = {A’u: ve S}. 


Lemma 7.18 Let A be a diagonalizable n x n matriz with eigenvalues 
JAi| > [Ag] > ++ > [An| 


and corresponding eigenvectors £1,22,...,X%n. Assume that for some m 
with 1 <m<n we have that |A,| > |Am+1| and define 


T :=span{z1,...,2m} and U :=span{@mii,.-.-,Zn}- 
Further, assume that S is a subspace of C" with dimension m satisfying 
SOU = {0}. 


Then the orthogonal projections Pays and Pr of ©” onto A’S and T, 
respectively, satisfy 


Vv 


’ vel, 


Am 
\|Pays — Prlle < M| + 


for some constant M; 1.e., the subspaces A” S converge to T. 


Proof. 1. First, we show that we can choose a convenient basis for S. 
Let y1,---;Ym denote a given basis of S. Then, for 7 = 1,...,m, we can 
represent 


m 
Yj = S > bjn re + 0;, (7.5) 
k=1 


where v; € U. We prove that the m x m matrix B = (b;,) is nonsingular. 
To accomplish this, assume that a1,...,Q , solve the homogeneous adjoint 
system 


SiR a; = 0, k= 1,...,m. 
j=1 


Then from (7.5) it follows that 


m ™m 
, OGY; — 5 O55, 
j=l j=l 
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and from this, with the aid of SQU = {0} and the linear independence of the 
yj, we conclude that a; = --- = am = 0. Hence, B indeed is nonsingular. 
We denote the entries of the inverse of B by B~! = (c;,). Then 


m 
£5 = So cyeye, j=1,...,m, 
k=1 


defines a new basis for S of the form 
Zj=@j+Uy;, 
where u; € U for 7 = 1,...,m. Because of 
A®zj = 52; + AXuj, j=l,...,m, 
the linearly independent vectors 


A” 2; 
Ws ‘_—_ = * 
jv: J 


rj 


form a basis of A” S. Since we can represent any u € U in the form 


nm 
u= ) ARLE 


k=m-+1 


from 
n 


A’u= S- AnrAL Lk 
k=m+1 
we conclude that there exists a constant L > 0 such that 


Vv 


py 
mary | f= 1,...,m, (7.6) 


llwjr — tyll2 <2 
m™m 
for ally € IN. 

2. By Corollary 3.53, the orthogonal projection of an element 7 € €” onto 
the subspace T is given by 


Pry = > OARXk, (7.7) 
k=1 
where the coefficients a1,...,Q@m solve the normal equations 
m 
S| og (te, 25) = (0,23), J =1,...,m. (7.8) 
k=1 
Analogously, we have 
Pavsn = >> Bev Wee (7.9) 


k=1 
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and 


S > Bev (Wer, Wir) =(n,wjy), j=l,...,m. (7.10) 
k=1 


We denote the m x m matrices of the linear systems (7.8) and (7.10) by X 
and W,, respectively. Then, with the aid of the Cauchy—Schwarz inequality, 
(7.6) implies that 


V 


Am 
Wr - X|l2 < Ci Ant , veN, (7.11) 


Am 


for some constant C,. We denote the right-hand sides of (7.8) and (7.10) by 

a and b,, respectively. Again from (7.6) and the Cauchy—Schwarz inequality 

we have that 

Am+1 
Am 


for some constant C2. Now, considering the linear system (7.10) as a per- 
turbation of (7.8), from Theorem 5.3 we can conclude that 


ll, — alle < C2 IInll2e, ve IN, (7.12) 


Vv 


Am 
IB, — all2 < C3 |] lalla, v ENN, (7.13) 
for the vectors a = (a4,...,@m)! and 6B, = (Biv,.-.,8mv)? and some 
constant C’3. From (7.7) and (7.9), using (7.6), (7.13), and the triangle in- 
equality, the assertion of the lemma follows. O 


The subspace T of Lemma 7.18 is invariant with respect to A; i.e., 
A(T) = T. By a knowledge of invariant subspaces the eigenvalue prob- 
lem for the full matrix A can be reduced to eigenvalue problems for two 
smaller matrices. Assume that 


P = (P,, P2) 


is a unitary matrix such that its first m columns represented by the matrix 
P, form a basis of T. Then Py AP, = 0, since T' is invariant with respect 
to A, and PP, = 0. Therefore, the unitary transformation yields 


4p. { PEAP PSAP, \ _ f An Ai \. 
prap=( Fite rear, ) = ( 0 Ago )? 


i.e., the eigenvalue problem for A is reduced to two smaller eigenvalue 
problems for the m x m matrix A,; and the (n —m) x (n—m) matrix Ago. 

The successive iterations of Lemma 7.18 yield only approximations A’ S 
to the invariant subspace T'. However, if 


Qy — (Qiv, Qov) 
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denotes a unitary matrix such that its first m columns represented by the 
matrix Q,, form a basis of A” S, then for 


we expect that Bo1,, — 0, v > oo. Before we can establish this result we 
need to investigate further the iteration of subspaces. 
Choose a basis y1,..., Yn of C” and consider the subspaces 


Sm i= Span{yi,---,Ym}, me=l,...,n—-1. 


For a simultaneous iteration of all the subspaces A”S,, it clearly suffices 
to iterate the basis vectors A’y,,..., A” yn. If the assumptions of Lemma 
7.18 are satisfied for each m = 1,...,n — 1, then 


A’ Sm > Tm := span{%1,..-,2m}, Vv oo, 


form = 1,...,n —1. Hence we expect to be able to construct unitary ma- 
trices Q, such that Q* AQ, — R, v > ov, where F is an upper triangular 
matrix that is similar to A. 

For the actual computation two difficulties arise. Firstly, the iterated 
vectors have to be scaled in order to avoid numerical overflow or un- 
derflow. Secondly, by Theorem 7.17, as vy — oo each of the n sequences 
(A’y,),...,(A”’yn) will converge to the subspace span{z;} spanned by the 
eigenvector for the eigenvalue A; with largest modulus. Hence, for large v 
the vectors A’y;,..., A” yn will be almost collinear; i.e., the basis elements 
A’yi,...,A’Yyn are almost linearly dependent and therefore ill-conditioned 
for spanning the iterated subspaces. 

Both these difficulties can be remedied by orthonormalizing the basis 
after each step. Assume that qi,,..-,;Qny are orthonormal vectors such 
that 

A’ Sm = span{qiy,---;Qmv}, m=1,...,n—1. 


Then we compute Aqi,,...,AQn,» and orthonormalize these vectors from 
left to right to obtain the vectors r1,,...,Tnv- This procedure preserves the 


property 


span{ripy,..-,’mv} = span{Aqip,.-., Amv} 


= A(span{qiv, oe Iku }) = Avtt Sin 
form=1,...,n—1. 


Theorem 7.19 Assume that A is a diagonalizable nxn matrix with etgen- 
values 
JAi| > |A2| > +++ > [An| 
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and corresponding eigenvectors 21, 2%2,...,2n, and set 
Tm := span{z,,...,2m} and U, := span{%m4i,---, tn} 
form =1,...,n—1. Let qio,.--,Qno be an orthonormal basis of C" and 


let the subspaces 
Sm := span{qio,---;Gmo} 


satisfy 
SmAUm = {0}, m=1,...,n—-1. 


Assume that for each v € IN we have constructed an orthonormal system 
div;--+5Qnv with the property 


A’ Sm = span{qiv,---;Qmv}, m=1,...,n-1, (7.14) 


and define Qy = (qiv;---,Gnv). Then for the sequence of matrices 
Ay = (G;z,v) given by 
Ay4+1 — QF AQL (7.15) 


we have convergence: 
lim ajnnp =0, 1<k<j<n, 
Vv—->oo 


and 
jim. 53, =Aj, g=l,...,n. 
Proof. 1. Without loss of generality we may assume that ||z,;||2 = 1 for 
4 =1,...,n. From Lemma 7.18 it follows that 
|Pavs, — Pr |le< Mr’, m=1,....n-1, veEN, (7.16) 


for some constant M and 


r= oma || <1 
From this, for the projections 
Wmp = Pars, Im, m=l1,...,n—1, 
and Wny := Zn, we conclude that 
||Wme —Lmllo< Mr’, m=1,...,n, veEN. (7.17) 
For sufficiently large v the vectors wi,,...,Wny are linearly independent, 


and we have that 


span{Wiy,---;Wmv} = A’Sm, m=l1,...,n—1. 
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To prove this we assume to the contrary that the vectors w1,,...,Wnp are 
not linearly independent for all sufficiently large v. Then there exists a 
sequence vz such that the vectors wip,,...,Wnv, are linearly dependent for 
each @ € IN. Hence there exist complex numbers ajz,...,Q@ng such that 


nr nr 
3 AkeWkn, =O and 3 lage|? = 1. (7.18) 


By the Bolzano—Weierstrass theorem, without loss of generality, we may 
assume that 
Qpe>ap, LA, k=1,...,n. 


Passing to the limit @— oo in (7.18) with the aid of (7.17) now leads to 


n Tr 
S- onze -0 and 3 lax|? = 1, 


which contradicts the linear independence of the eigenvectors 21,...,Zn.- 
2. We orthonormalize by setting p, := x, and 


Dm — Lm — Pr,,_,@m, m= 2,...,Nn, 
Dm i= Pm sm =1,...,n, 
\lPm|l2 


and, analogously, 01, := wi, and 


Umv >= Wmv — Pars,_1.Wmv) m= 2,...,n, 
Ump 
Umpvp — TT ; = 1, oy ft. 
ll@mvle 


Then 
span{pi,...,Pm}=TIm, m=1,...,n—1, 


and by repeating the above argument, 

span{vjy,.--,;Umv} = A’ Sm, m=l,...,n—-1, (7.19) 
for sufficiently large v. Writing 
Dm — Umy = Lm — Wmv + (Pavs,,_, — Pr, )&m + Pav sm_, (Wmv — Lm); 
with the aid of (7.16) and (7.17) we obtain that 

lomv — Pmile<3Mr’, m=1,...,n, VEN. 
From this and the representation 


Umv lDm|le — llOmv lle 4 Ump — Dm 
ll@mu|l2 \|Prna ll |Bm|l2 


Umv —~ Pm = 
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it follows that 
llUmy — Pmila< Cr’, m=1,...,n, VEN, (7.20) 


for some constant C’. 
3. From (7.14) and (7.19), by induction, we deduce the existence of phase 
factors Ymy € C with |ymv| = 1 such that 

Qmv = PmvUmv; m= 1,...,7. 
Therefore, defining the diagonal matrices D, = diag(Yiv,-.--, nv) and the 
unitary matrices V, = (vip,...,Unv), we have the relation 

DLV, = VD, — Qp. 

This implies that 


Ayy1 = Q* AQ, = D*V3 AV,D, 
(7.21) 
= D*(V,* — P*)AV,D, + D*P* A(V, — P)D, + DiP* APD,, 


where P = (p1,...,Pn). Because of (7.20) we have that 
Vv — Plle = [lV — P*ll2 +0, v — 0. 


Furthermore, D* P* APD, is an upper triangular matrix with diagonal el- 
ements diag(\1,...,An). Hence, the assertion of the theorem follows by 
passing to the limit v > oo in (7.21). We note that for the elements above 
the diagonal we do not, in general, have convergence because of the occur- 
rence of the phase factors. O 


For the actual numerical implementation we have to describe the compu- 
tation of A, according to (7.15). From page 20 we recall that orthonor- 


malizing n vectors a;,...,@, from left to right is equivalent to determining 
orthonormal vectors qi,...,@n and an upper triangular matrix R = (rjx) 
such that 


k 
a, =) TikeQi, K=1,...,n. 
i=1 


For the matrices A = (a,...,@n) and Q = (q1,..-.,qQn) this corresponds to 
a QR decomposition 

A=QR 
as described in detail in Section 2.4. Now assume that A, = Q*_,AQ,-1 
has been determined according to (7.15). To generate A,4, from this, a 
QR decomposition of the matrix AQ,-~1 is required, since 


A” Sm = AAY'Sim = span{Aqi,-1,---;AGm,v-1}- 
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This is obtained from a QR decomposition 
A, = QR, (7.22) 
of A, by 
AQy-1 = Qy—1Ap = Q)-1Q) Rr = QR, 
where Q, = Q,_1Q,. From this we find that 


Avi = Qt AQ, = Q*ALQ, = RLQv. (7.23) 


Hence the two equations (7.22) and (7.23) represent one step of the succes- 
sive iterations of subspaces as described in Theorem 7.19. 

Now the QR algorithm consists in performing these iterations starting 
from the canonical basis e),...,¢,, which means that in the first step a 
QR decomposition is required for A; = A = (Ae,,..., A€n). 


Theorem 7.20 (QR algorithm) Let A be a diagonalizable matrix with 
eigenvalues 
JA] > |A2] > +++ > |An| 


and corresponding eigenvectors £1,22,.-..,Xn, and assume that 
span{e},...,@€m}Nspan{@m41,.--,2n} = {0} (7.24) 


form =1,...,n—1. Starting with A, = A, construct a sequence (A,) by 
determining a QR decomposition 


A, = QR, 
and setting 
Ay4i = RYQ, 
forv =0,1,2,.... Then for A, = (ajx,,) we have convergence: 


lim ajx,v = 0, l<k<j<n, 
y—+00 


and 
jim ajju = Aj, jg=l,...,n. 
Proof. This is just a special case of Theorem 7.19. 0 


We proceed with a discussion of the assumption (7.24). Define the ma- 
trices X := (41,...,%n) and Y := X~' = (y;,). Then the identity J = XY 
means that 


n 
ej = ) LeEYkj> g=1,...,n. 
k=1 
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For fixed m = 1,...,n — 1 the property (7.24) holds if and only if 


) aje; € span{%m4i,.--,2n} 
j=l 


implies that a; = --- = @m = 0. This in turn is satisfied if and only if the 
homogeneous linear system 


m 
Ss ynjaj = 0, k=1,...,m, 


admits only the trivial solution, since 


2 i = Ya Ys 
k=1 j=l 


Hence (7.24) holds if and only if for m = 1,...,n — 1 the m x m sub- 
matrices (y,;), k,j = 1,...,m, are nonsingular. This means that for the 
matrix Y, Gaussian elimination works without interchanging columns; i.e., 
the matrix Y has an LR decomposition. Since Gaussian elimination with 
column pivoting always works, there exists a permutation matrix P such 
that we have an LR decomposition PY = LR (see Problem 2.16). Hence 
it is plausible that the assumption (7.24) is not very restrictive. Indeed, 
it can be shown that convergence of the QR algorithm also holds when 
(7.24) is not satisfied. However, in general, the eigenvalues on the diagonal 
will not occur ordered according to their size (see [65]). Furthermore, it 
can be shown that in the case of eigenvalues with the same modulus, the 
QR algorithm still works in the sense of an appropriately modified version 
of Theorem 7.20. For example, for two conjugate complex eigenvalues, the 
upper rectangular matrix will be distorted through a two-by-two block on 
the diagonal. The blocks do not converge, but still the conjugate complex 
eigenvalues can be obtained as eigenvalues of the individual two-by-two 
blocks (see [65]). 

In principle, the QR decomposition required in each step of the QR 
algorithm can be done through the Gram—Schmidt procedure. However, in 
practice, because of the ill-conditioning of the Gram—Schmidt procedure, 
orthogonalizing by Householder transformations is preferable. For details 
we refer back to Section 2.4. 

The basic form of the QR algorithm as described above is not yet efficient 
enough for applications, since each iteration step requires O(n*) operations. 
The speed of convergence is determined by the location of the eigenvalues 
with respect to one another. The matrix A —oI has the eigenvalues A; — 0 
for 7 = 1,...,n. If we choose for o an approximate value of the eigenvalue 
An of smallest absolute value, then A, — a becomes small. This will speed 
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up the convergence in the last row of the matrix, since 


|An — | 


———_—_— < l. 
|An—1 — o| 


Having reduced the elements of the last row to almost zero, the last row 
and column of the matrix may be neglected. This means that the smallest 
eigenvalue is deflated by canceling the last row and column, and the same 
procedure can be applied to the remaining (n — 1) x (n—1) matrix with the 
parameter o changed to be close to A,_1. This so-called shift and deflation 
strategy leads to a tremendous speeding up of the convergence. For details 
we refer to (27, 65]. 

The computational costs of one step of the QR algorithm is reduced when 
the matrix has a large number of zero entries. For example, for tridiagonal 
matrices all matrices generated in the QR algorithm remain tridiagonal. In 
the following section we will consider so-called Hessenberg matrices, which 
differ from upper triangular matrices only by a non-zero first subdiagonal. It 
can be shown (see Problem 7.16) that the Hessenberg form is also invariant 
with respect to the QR algorithm. Hence, for practical computations it is 
convenient first to transform the matrix into Hessenberg form. 

In general, comparing the computational costs, for symmetric matrices 
the QR algorithm is superior to the Jacobi method. However, the actual 
programming for the Jacobi method is very simple as compared with the 
QR algorithm. Hence for small matrix size n the Jacobi method is still 
attractive. 


7.0 Hessenberg Matrices 


Definition 7.21 Ann xn matriz B = (b;;,) is called a Hessenberg matrix 
if bj, =O forl <k <j7-2,9 = 3,...,n; 1e., in the lower triangular 
part of a Hessenberg matriz only the elements of the first subdiagonal can 
be different from zero. 


We proceed by showing that each matrix A can be transformed into 
Hessenberg form by unitary transformations using Householder matrices. 
We start with generating zeros in the first column by multiplying A from 
the left by a Householder matrix H,. We write 


_f{ @1 * 
-(% 3). 
where A is an (n — 1) x (n — 1) matrix and a an (n — 1) vector. Then 
considering a Householder matrix H, of the form 


1 0 
m= (4 i, ) 
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where H, = I — 20,0% is an (n — 1) x (n — 1) Householder matrix, we have 


and 


* a1 FO 
Hy, AH, = ( Aya, H,AH? ). 


As shown in the proof of Theorem 2.13, choosing 


~ 


V1= 


U4 


WATPET ST 


where 
U, = a1 + 0(1,0,...,0)7 
and ao, 
Gia, an #0, 
|a2i| 


/ aja ; a2, = 0, 


eliminates all elements of a; with the exception of the first component. 
Hence the first column of the transformed matrix is of the required form. 
Now assume that A, is an n x n matrix of the form 


_ By * 
Av= (94, a.) 


where B, is a k x k Hessenberg matrix, A,—, an (n — k) x (n—k) matrix, 
G, an (n — k) vector, and 0 the (n — k) x (k — 1) zero matrix. Then for a 
Householder transformation of the form 


where J, denotes the k x k identity matrix and H,_, is an (n—k) x (n—k) 
Householder matrix, it follows that 


x By * 
An Hy = ( O a An-rH7_, 


* By * 
Hy Axl, ~ ( 0 H,,_ 4G Hy_~An—-nH?_, ) 


and 


Now, proceeding as above, we can choose H,,_% such that all elements of 
H,,—,~@, vanish with the exception of the first component. This procedure 
reduces a further column into Hessenberg form. We can summarize our 
analysis in the following theorem. 
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Theorem 7.22 To each n xX n matriz A there exist n — 2 Householder 
matrices H,,...,Hn—2 such for Q = Hyn_2--- Hy, the matriz 


B=Q*AQ 
ts a Hessenberg matrix. 


For a Hessenberg matrix the value of the characteristic polynomial and 
its derivative at a point A € © can be computed easily without computing 
the coefficients of the polynomial. These two quantities are required for 
employing Newton’s method for approximating the eigenvalues as the zeros 
of the characteristic polynomial. We first. consider the case of a symmetric 
Hessenberg matrix. 


Example 7.23 Let 


Cn—1 Q@n-1 Cn 
Cn an 


be a symmetric tridiagonal matriz. Denote by Ay the k x k submatrix con- 
sisting of the first k rows and columns of A, and let p, denote the charac- 
teristic polynomial of Ay. Then we have the recurrence relations 


pre(A) = (ag — A)pe—1(A) — Chpp_2(A), k= 2,...,n, (7.25) 
and 
Py(A) = (Gk — A)Pe—1 (A) — CEPE-2(A) — Pe-1(A), = 2,...,n, (7.26) 
starting with po(A) = 1 and p,(A) = a, — X. 


Proof. The recursion (7.25) follows by expanding det(A, — AJ) with respect 
to the last column, and (7.26) is obtained by differentiating (7.25). 0 


Example 7.24 The n x n tridiagonal matrix 


2 -1 
—-1 2 -1 
A= —-1 2 -1 
—l1 2 -1 
—] 2 
has the eigenvalues 
rj = 4sin? i g=1,...,n 


2(n+ 1)’ 
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(see Example 4.17). Table 7.1 gives the results of the Newton iteration using 
(7.25) and (7.26) for computing the smallest eigenvalue Amin = A1 and the 
largest eigenvalue Amax = An for n = 10. The starting values are obtained 
from the Gerschgorin estimates |\ — 2| < 2 following from Theorem 7.7. 0 


TABLE 7.1. Hessenberg method for Example 7.24 


4.00000000 | 0.00000000 
3.95000000 | 0.05000000 
3.92542110 | 0.07457890 


3.91933549 | 0.08066451 
3.91898705 | 0.08101295 
3.91898595 | 0.08101405 
3.91898595 | 0.08101405 


We conclude this section by describing the computation of the quotient 
of the value of the characteristic polynomial p(A) = det(B — AJ) and its 
derivative for a general Hessenberg matrix B = (b;,). We assume that 
b; 3-1 # 0 for 7 = 2,...,n; i-e., B is irreducible (see Problem 7.15). For a 
given A we determine 


€ = €(A) = (&1,---,&n)* 
and a = a(A) such that 
(b11 — A)Er + bi2f2 + vo + bingn = a, 


boi€) + (be2 — A)Eg + vee + benEn = 0, 


On n—-1€n-1 + (Onn _ A) En — Q, 


and £, = 1. This is an n X n upper triangular linear system for the n 


unknowns a, £1,...,€n)—1, and it can be solved by backward substitution. 
Setting 
bi, —A bi2 - . By n-1 a 
C= bo1 bo1 —-v’ .. be n—1 0 . 
bana 0 


by Cramer’s rule we have that 


l=6,= det C’ _ (—1)"~' ba, -- ‘Dn n—1a 
~s" = Get(B-AI) + det(B-AN)”’ 
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that is, 
p(r) = (—1)"~*ba vce bn n—-1a(A). 


Differentiating the last equation yields 
p(X) = (-1)"7 "bai +++ Ban—10"(A), 


and therefore 
P(A) _ (A) 
pA) a’(A) 
By differentiating the above linear system with respect to \ we obtain 
the linear system 


(b1. —A)m + bieja t+ ces + Oy n-1m-1 = + B, 


beim + (be2 —A)n2 + -*+ + ba n-1Mn-1 = &2, 


On. n—-1Mn-1 = En 


for the derivatives @ = a',m = &,.--,%m-1 = &,_,. This linear sys- 
tem again can be solved by backward substitution for the n unknowns 
B,15---;n—1- Thus we have proven the following theorem. 


Theorem 7.25 Let B = (b;,) be an irreducible Hessenberg matriz and let 
AEC. Starting from E, = 1, nn = 0, compute recursively 


7m 


1 
on—k = pF Mn — S > bn —Kt1,5&j Ps 
n—k+1,n—k j=n—k+1 
1 n 
In-k = be kat ne En—k+1 + ANn-k+1 _ > On—k+1,5N5 
n—k+1,n—k j=n—k+1 


fork =1,...,n—1 and 


a= —rf + ye bh, 


j=l 


B= —& —Am + So bijny- 


j=l 
Then for the characteristic polynomial of B we have 


P(A) _ a 


pA) Be 
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Problems 


7.1 For the eigenvalues (repeated according to their algebraic multiplicity) of 
an n X n matrix A show that 


trA= Sd; and detA= |]. 
j=l j=l 


7.2 For Example 7.1 show that in the case p = 0 the eigenvalues of the matrix 
A converge to the eigenvalues of the differential operator D as n > oo. 


7.3 Show that the eigenvalues of the adjoint matrix A* are the complex conju- 
gate of the eigenvalues of the matrix A. 


7.4 Show that for the eigenvalues of Hermitian matrices the geometric and the 
algebraic multiplicities coincide. 


7.5 Use Gerschgorin’s Theorem 7.7 to determine the approximate location of 
the eigenvalues of the matrix 


1 -1 QO 
A= 1 9 1 |]. 
—2 -1 9 


To check the estimates, compute the eigenvalues by finding the zeros of the char- 
acteristic polynomial. 


7.6 Let A be a diagonalizable n x n matrix with eigenvalues Ai,...,An, B an 
n X n matrix, and X an eigenvalue of A+ B. Show that 


min [A—A51 <[ICllellC~llellBlle 


’ 


where C is a nonsingular matrix such that C~' AC is diagonal and p = 1, 2, oo. 


7.7 Show that the Frobenius norm is indeed a norm on the linear space of 
matrices. 


7.8 Write a computer program for the Jacobi method and test it for various 
examples. 


7.9 Assume that A is a real symmetric n X n matrix with eigenvalue of 
multiplicity n—1 and a further eigenvalue » # A. Show that A = AI +(w—A)zrz", 
where x*z = 1 and that by at most n — 1 Jacobi transformations A becomes 
diagonal. 


7.10 Show convergence of the cyclic Jacobi method with threshold [N(A)]?/(2n7). 


7.11 Let A be a diagonalizable n x n matrix with eigenvalues 41,...,An and 
eigenvectors 21,...,2n, and assume that |Ai| > |A2| > |As| > --- > |An|. Starting 
from vo € ©” with vo ¢ span{x2,...,2%n} show that the sequence 

A 
ty vy = 0,1,2,..., 


Vv = 
eee Awe fla 
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is well-defined and that the sequence of Rayleigh quotients 
_ (Av, vw) 


R, := ——z— ,_ v=0,1,2,..., 
zal: 


satisfies the estimate 
|IRu-—Ai|< Cr’, v=0,1,2,..., 


for some constant C > 0 and r := |A2/A1|. 


1 2 1 
A= 15 1 1. 
2 0.5 0.5 


has eigenvalue \ = 4 with eigenvector = (1,1,1)’. Construct a Householder 


matrix H such that 
4 * * 
HAH* = QO * x 
QO +* «* 


and determine the remaining eigenvalues. 


7.12 The matrix 


7.13 Write a computer program for the QR algorithm and test it for various 
examples. 


7.14 Verify the numerical results of Table 5.2 for the Hilbert matrix. 


7.15 Show that Hessenberg matrices B = (bj;,) with bj ;-1 4.0 for j =2,...,n 
are irreducible. 


7.16 Show that the Hessenberg form of a matrix is preserved by the QR algo- 
rithm. 


7.17 Show that the number of multiplications required for the transformation 
of a matrix into Hessenberg form via Householder transformations according to 
Theorem 7.22 is 5n?/3 + O(n’). 


7.18 Write a computer program for transforming a matrix into Hessenberg form 
via Householder transformations according to Theorem 7.22. 


T 


7.19 Discuss Newton’s method for the solution of Ax = Ax, x* x = 1 in the 


neighborhood of a simple eigenvalue of a real symmetric matrix A. 


7.20 Prove the inequality 
n 1 . . 1/2 
So PsP? < (Ile - 5 1144" - All] 
j=l 


for the eigenvalues of an n x n matrix A (see Corollary 7.9 and [41)). 


8 


Interpolation 


Polynomials have attracted the attention of mathematicians for centuries 
because of their many beautiful properties. For numerical purposes they 
have the advantage that their computation reduces to additions and mul- 
tiplications only. Therefore, it is quite natural to use polynomials for the 
approximation of more complicated functions. A classical approach to spec- 
ifying the coefficients of a polynomial of degree n is to prescribe that its 
values at n+ 1 distinct points coincide with those of the function to be 
approximated. The development and investigation of such interpolation 
polynomials has a long mathematical history, beginning with the use of 
the method of interpolation to tabulate the logarithms, as proposed by 
Briggs in the early seventeenth century. 

It is the purpose of the first section, Section 8.1, of this chapter to intro- 
duce the classical theory of polynomial interpolation, including discussions 
on the effective numerical computation of interpolation polynomials and an 
analysis of the resulting approximation error. The next section, Section 8.2, 
describes the corresponding theory for the interpolation of periodic func- 
tions by trigonometric polynomials. For a detailed study of the foundations 
of classical interpolation theory we refer to [16]. 

In the last two sections, Sections 8.3 and 8.4, we proceed with a study 
of interpolation by splines, i.e., piecewise polynomial interpolation, which 
was developed within the last fifty years and has turned into a successful 
tool in approximation theory and other parts of numerical analysis. For a 
comprehensive study of spline functions we refer to [18, 53], and for their 
use in computer-aided geometric design we refer to [23]. 
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We would like to point out that interpolation is not only important as 
a tool for the approximation of functions that are difficult to compute or 
whose values are known only at discrete points. It also serves as an essential 
ingredient for developing numerical integration rules and methods for the 
approximate solution of differential and integral equations, as we shall see 
in the following chapters. 


8.1 Polynomial Interpolation 


For n € INU {0}, we denote by P,, the linear space of polynomials 


n 
p(x) = S- a,a* 
k=0 


for a real (or complex) variable x and with real (or complex) coefficients 
ao,..-,;@n. A polynomial p € P, is said to be of degree n if a, 4 0. In 
this chapter, we consider P,, as a subspace of the linear space Ca, b] of 
continuous real- (or complex-) valued functions on the interval [a, 6], where 
a < b. For m € IN we denote by C™|[a,}] the linear space of m times 
continuously differentiable real- (or complex-) valued functions on [a, )]. 
We recall the following basic uniqueness property of algebraic polynomi- 
als as part of the fundamental theorem of algebra. Since we will use this 
property frequently, it is appropriate to include a simple proof by induction. 


Theorem 8.1 Forn € INU{0}, each polynomial in P,, that has more than 
n (complex) zeros, where each zero is counted repeatedly according to its 
multiplicity, must vanish identically; 1.e., all its coefficients must be equal 
to zero. 


Proof. Obviously, the statement is true for n = 0. Assume that it has been 
proven for some n > 0. By using the binomial formula for x* = [(2—z)+2]* 
we can rewrite the polynomial p € P,+; in the form 


n+1 
p(x) = S- by(z — z)* + bo 
k=1 
with the coefficients bo, bi,...,bn+1; depending on ag, qa1,...,Gn41 and z. 
If z is a zero of p, then we must have bo = O, and this implies that 


p(x) = (x — z)q(x) with gq € P,. Obviously, g has more than n zeros, 
since p has more than n+ 1 zeros. Hence, by the induction assumption, q 
must vanish identically, and this implies that p vanishes identically. O 


Theorem 8.2 The monomials u,z(x) := «*, k = 0,...,n, are linearly in- 
dependent. 
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Proof. In order to prove this, assume that 


nmr 
3 QELUE = 0, 
k=0 
that is, 
mr 
S 5 ag =0, « € [a,b]. 
k=0 


Then the polynomial with coefficients ag,a1,...,@n, has more than n dis- 
tinct zeros, and from Theorem 8.1 it follows that all the coefficients must 
be zero. O 


The linear independence of the monomials ug,...,tn implies that they 
form a basis for P, and that P,, has dimension n + 1. 


Theorem 8.3 Given n+ 1 distinct points xo,...,2n € [a,b] andn+1 
values Yo,.--,Yn © IR, there exists a unique polynomial p, € P, with the 
property 

Dn(Z;) = Yj) 7 =Q,...,n. (8.1) 


In the Lagrange representation, this interpolation polynomial ts given by 
n 
Pn = >_ eee (8.2) 
k=0 


with the Lagrange factors 


Proof. We note that €, € P, for k = 0,...,n and that the equations 
€y(x;) = Osx, J,k =0,...,n, (8.3) 


hold, where 6;, = 1 for k = j, and 6;, = 0 for k # 7. It follows that p, 
given by (8.2) is in P,, and it fulfills the required interpolation conditions 
Pr(2j) = Yj, J =0,...,N. 

To prove uniqueness of the interpolation polynomial we assume that 
Pn,l, Pn,z € Py are two polynomials satisfying (8.1). Then the difference 
Pn := Pn, — Pn, Satisfies p,(z;) = 0,7 = 0,...,n; ie., the polynomial 
Pn © Pn has n+ 1 zeros and therefore by Theorem 8.1 must. be identically 
zero. This implies that pp = pn. QO 


The representation (8.2), which was discovered by Lagrange in 1794, is 
very convenient for theoretical investigations because of its simple struc- 
ture. However, for practical computations it is suitable only for small n. For 
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n large the Lagrange factors become very large and highly oscillatory, which 
causes ill-conditioning of the Lagrange interpolation polynomial. Already 
in 1676, in his study of quadrature formulae (see Theorem 9.3), Newton 
had obtained a representation of the interpolation polynomial that is more 
practical for computational purposes. For its description we need to give 
the following definition. 


Definition 8.4 Given n + 1 distinct points xo,...,2n € [a,b] andn+1 
values Yo,---,Yn © IR, the divided differences D* of order k at the point 
xz; are recursively defined by 


D; = Yj, Jr 0, Mn, 
D*-} _ D7} 
+1 
Dj := = J j=0,....n—-k, k=1,...,n 
Lijtk ~ Uj 
We notice that the points zo,...,Z,, need not be in ascending order. It 


is convenient to arrange the divided differences according to the tableau 


Zo =66Yo = DB 


Do 
ty Y= D? Dj 

Di D3 
t2 yz = Do Dy 

D3 


fg ¥3 = D3 


which we illustrate by the following example. Obviously, for the full tableau 
the computational cost is of order O(n’). 


Example 8.5 For the points z9 = 0,2, = 1, 22 = 3,24 = 4 and the 
values yo = 0, yi = 2, yo = 8, ya = 9 the tableau of the divided differences 
is given by 


0 0 
2 
1 2 1/3 
3 ~1/4 
3 8 ~2/3 
1 
4 9 


Each value Dj in the kth column is obtained by taking the difference of 


the two neighboring values Dy and D** in the preceding column and 
dividing it by the difference 734, — 2; of the points 2;,, and z;. 0 
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Lemma 8.6 The divided differences satisfy the relation 


jt+k j+k 1 
Di = Yi um TT > 9 =0,--n-k, B=1,...m (8-4) 
m=j i=j , 
ifm 


Proof. We proceed by induction with respect to the order k. Trivially, (8.4) 
holds for k = 1. We assume that (8.4) has been proven for order k — 1 for 
some k > 2. Then, using Definition 8.4, the induction assumption, and the 
identity 


1 { 1 1 7 1 
Tjtk— 27 (Lm —-—UVj+k Lm — Xj (Lm — Lj+k)(Lm — Lj) ” 


we obtain 


1 gtk gtk 1 jt+tk—-1 j+k-1 1 
k 
Di =——— 1 > om TT] Sa hm I 
Ith I | maiti  iajti m= i=j 
ifm ifm 
l jt+k—1 1 1 j+k-1 l 
= vm -—} ]] 
HW a —@ ; Lm — 25 Lm — & — Li 
ifm 
j+k—-1 j+k 1 j+k j+k 1 
+e TT ——+u I] -=7 = Le I, 
i=j +k 0 i=j+1 J ¢ m=) i=j ¢ 
ifm 
i.e., (8.4) also holds for order k. Oo 


Theorem 8.7 In the Newton representation, for n > 1 the uniquely de- 
termined interpolation polynomial p, of Theorem 8.3 is given by 


n k—1 
Pn(z) = yo + 3 Dé [[(@ — 2). (8.5) 


Proof. We denote the right-hand side of (8.5) by py, and establish pn = pn 
by induction with respect to the degree n. For n = 1 the representation 
(8.5) is correct. We assume that (8.5) has been proven for degree n — 1 for 
some n > 2 and consider the difference dy, := pp — pn. Since 


dn(2) = pa(2) ~ Bn-1(#) ~ DB [I (@ - 20), 
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as a consequence of Theorem 8.3 and Lemma 8.6 the coefficient of x” in the 
polynomial d, vanishes; i.e., d) € P,—,. Using the induction assumption, 
we have that 


Pn—1(@j) = yj = Pn(2;5), j =9,...,n—-1, 


and therefore 
d,(xz;) = 0, 7=0,...,n—1. 


Hence, by Theorem 8.1 it follows that d, = 0, and therefore p, = p,. QO 


Example 8.8 The interpolation polynomial corresponding to Example 8.5 
is given by 


p3(2) = 2n + 5 a(x - 1) ~ 7 a(t - 1) ~3) O 


Analogously to the Horner scheme (see (6.10)), the value of the Newton 
interpolation polynomial at a point xz can be obtained by nested multipli- 
cations according to 


Pn(£) = An(x% — Xo)(@ — 41) ++ -(@ — En-1) +++ +a1(£ — Zo) + ao 
= (...(@n(£@ — fn_-1) + Gn_1)(@ — Fn_-2) +--+ +41)(% — Zo) + ao 


by O(n) multiplications and additions. For an evaluation of the interpola- 
tion polynomial at a single point z without explicitly computing the coef- 
ficients of the polynomial, the following Neville scheme is very practical. 
From the formal coincidence of the recursion (8.6) and Definition 8.4 for 
the divided differences, it is obvious that the computations for (8.6) can be 
arranged in a tableau analogous to the tableau for the divided differences. 


Theorem 8.9 Given n+ 1 distinct points xo,...,2n € [a,b] and n+ 1 
values Yo,---,Yn © IR, the uniquely determined interpolation polynomials 
pk € Py, i=0,....n—k, k=0,...,n, with the interpolation property 


Dp; (x3) = Ys; j=t,...,t+k, 
satisfy the recursive relation 


k— k—1 (8.6) 
ok(2) = (v= 2i)peyi (@) ~(@— tie)PE pg, 
Lit+k — Li 


Proof. We again proceed by induction with respect to the degree k. Obvi- 
ously, the statement is true for k = 1. Assume that the assertion has been 
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proven for degree k — 1 for some k > 2. Then the right-hand side of (8.6) 
describes a polynomial p € P,, and by the induction assumption we find 
that the interpolation conditions 


p(z;) = (x; — £i)yj — (aj — Litk)y; = Y;, j=itl,...,it+k-], 
Liik — Li 
as well as p(z;) = yj and p(zi+4%) = yi+e are fulfilled. O 


The main application of polynomial interpolation consists in the approx- 
imation of continuous functions f : [a,b] + IR. In this case, given n + 1 
distinct points r9,...,2%n € [a,b], by 


L,: Cla, b] + Py 


we denote the interpolation operator that maps the function f € C[a,}] 
onto its uniquely determined interpolation polynomial L, f € P, with the 
property 

(Ln f)(aj) = f(aj), 7 =0,...,n. (8.7) 
From the Lagrange representation (8.2) it can be seen that the operator 
L,, is linear and bounded (see Problem 8.4). Moreover, since L,p = p for 
all p € P,, the interpolation operator is a projection; i.e., L2 = Ln. 

The interpolation polynomial L,,f is used as an approximation for the 
function f, since in general, the polynomial L,,f is better suited for com- 
putational purposes than the original function f. In the sequel we shall be 
concerned with estimating the approximation error f — L,f. 


Theorem 8.10 Let f : [a,b] + IR be (n + 1)-times continuously differen- 
tiable. Then the remainder R,f := f — Lnf for polynomial interpolation 


with n+ 1 distinct points xo,...,2n € [a,b] can be represented in the form 
(Rif)(z) = n+ — [le -2 ); LE [a, b], (8.8) 


for some & € [a,b] depending on zx. 


Proof. Since (8.8) is trivially satisfied if x coincides with one of the inter- 
polation points ro,...,£n, we need be concerned only with the case where 
x does not coincide with one of the interpolation points. We define 


Qn41(x) : - Ile — 25) 


and, keeping z fixed, consider g : [a,b] — IR given by 


f(t) — Enf)(2) 


g(y) = f(y) — (Ln f)(y) — Qn4i(y) Gea (o) ; 


y € [a, 5}. 
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By the assumption on f, the function g is also (n + 1)-times continuously 
differentiable. Obviously, g has at least n+2 zeros, namely x and Zo,..., Zn. 
Then, by Rolle’s theorem the derivative g’ has at least n+1 zeros. Repeating 
the argument, by induction we deduce that the derivative g'"+") has at least 
one zero in [a,b], which we denote by €. For this zero we have that 


vel )(x) 


0= fG—(n tI, 


and from this we obtain (8.8). O 


The intermediate point € in the error representation (8.8) is not, in gen- 
eral, known explicitly. Therefore, the interpolation error is estimated by 
the following corollary. 


Corollary 8.11 Under the assumptions of Theorem 8.10 we have the error 
estimate 


1 
lRnF lloo < (n+1)! l@n+1 loo WFO? Loo 


Example 8.12 The linear interpolation is given by 


(Li f(a) = 5 Ufl@o)(m — 2) + F(en)(2 ~ 20)] 


with the step width h = 2; —2o. For the polynomial q(x) = (a4#—29)(x—21) 
we have that 
h2 
cima = 


Therefore, by Corollary 8.11, the error occurring in linear interpolation of 
a twice continuously differentiable function f can be estimated by 


2 


(Ri f(a) < "max |f"(y), 2 € [eo]. (8.9) 


8 y€(zo,21] 


For example, the error in linear interpolation with step size h = 0.01 for 
the sine function is less than or equal to h?/8 = 0.0000125. Oo 


By the following examples we want to introduce the question of whether 
the interpolation polynomials converge when the number n + 1 of inter- 
polation points, and hence the degree n of the interpolation polynomials, 
tends to infinity. 


Example 8.13 Let f(x) := sing and let zo,...,@, € [0,2] be n+ 1 dis- 
tinct points. Since 
IF (a) <1, x € [0,7], 


and 
l@n+1(x)| < arth LE (0, x], 


8.1 Polynomial Interpolation 159 


by Corollary 8.11, we have the estimate 
n+1 


(n SI) , « € (0,7). 


(Rn f)(z)| < ~—7, 
Hence the sequence (L,,f) of interpolation polynomials converges to the 
interpolated function f uniformly on [0,7] as n — oo. O 


Example 8.14 A first detailed example of the insufficiency of polynomial 
interpolation even for analytic functions was investigated by Runge in 1901. 
He considered the simple function 


1 


Fe) = Ty 3503 


on the interval [—1, 1] with equidistant interpolation points. He discovered 
that as the degree n tends to infinity, the interpolation polynomials diverge 
for 0.726 < |z| < 1, whereas the approximation works satisfactorily in the 
central portion of the interval (see Problem 8.6). Although f is analytic 
in all of IR, its poles in the complex plane at +7/5 are responsible for this 
divergence. 0 


Example 8.15 Consider the continuous function 


x sin — ; x € (0, 1], 
x 


0, x = 0. 


With the interpolation points chosen as 


we have that f(z;) = 0, 7 = 0,...,n, and therefore L,f = 0 for all 
n. Hence, in this case the sequence (L,f) converges only at the points 
£3, 7 € INU {0}, to the interpolated function f. O 


These three examples illustrate that for polynomial interpolation both 
convergence and divergence are possible. We complement the examples by 
stating the following two theorems without detailed proofs. 


Theorem 8.16 (Marcinkiewicz) For each function f € Cla, b] there ex- 
ists a sequence of interpolation points (a), 7 = 0,...,n, n = 0,1,..., 
such that the sequence (Lf) of interpolation polynomials L,,f € Py, with 


(Ln f)(2\”) = f(a), j =0,...,n, converges to f uniformly on [a, 5]. 


Proof. The proof relies on the Weierstrass approximation theorem and the 
Chebyshev alternation theorem. The Weierstrass approximation theorem 
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(see [16]) ensures that for each f € Cla, b] there exists a sequence of poly- 
nomials p, € P, such that ||p, — f||o. 4 0 as n — oo. As a consequence of 
the Chebyshev alternation theorem from approximation theory (see [16]), 
for the uniquely determined best approximation p,, to f in the maximum 
norm with respect to P,,, the error p, — f has at least n + 1 zeros in |[a, 5]. 
Then taking the sequence of these zeros as the sequence of interpolation 
points implies the statement of the theorem. 0 


Theorem 8.17 (Faber) For each sequence of interpolation points (x\”) 
there exists a function f € Cla, b] such that the sequence (L,f) of interpo- 
lation polynomials L,f € Pp, does not converge to f uniformly on [a,b]. 


Proof. This is a consequence of the uniform boundedness principle, Theo- 
rem 12.7. It implies that from the convergence of the sequence (L,,f) for 
all f € C[a, b] it follows that there must exist a constant C' > 0 such that 
|Znlloo < C for all n € IN. Then the statement of the theorem is obtained 
by showing that the interpolation operator L,, satisfies ||Dy||o. > clnn for 
all n € IN and some c > 0 (see [16)]). 0 


We conclude this section by briefly describing Hermite interpolation, 
where in addition to the values of the polynomial, the values of its first 
derivative at the interpolation points are also prescribed. 


Theorem 8.18 Given n+ 1 distinct points xo,...,%n € [a,b] and 2n + 2 
values yo,---,Yn € R and yf,...,y,, € R, there exists a unique polynomial 
Pant € Poni with the property 

Pont1(Z;) = Yj» Pony (z;) — Vis j = 0, oeeg lt. (8.10) 


This Hermite interpolation polynomial is given by 


nm 


Ponti = > _[yeHe + ue He] (8.11) 
k=0 


with the Hermite factors 
Hy (x) := [1 — 26, (xn)(@ — te)] (€e(x)]?, Hg (@) == (@ — ae) [4(2))? 
expressed in terms of the Lagrange factors from Theorem 8.3. 


Proof. Obviously, the polynomial p2,41 belongs to P2,41, since the Hermite 
factors have degree 2n + 1. From (8.3), by elementary calculations it can 
be seen that (see Problem 8.7) 


AR (x;) = Hy! (x;) = Ojk> 
7,k=0,...,n. (8.12) 
Hy! (xj) = Hy (xj) = 0, 
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From this it follows that the polynomial (8.11) satisfies the Hermite inter- 
polation property (8.10). 

To prove uniqueness of the Hermite interpolation polynomial we assume 
that pon+1,1, Pen+1,2 © Pen41 are two polynomials having the interpolation 
property (8.10). Then the difference pon+41 := pon+1,1 — Pen+1,2 satisfies 


Ponti(2j) = Pongi (tj) =0, jg =0,...,n 


i.e., the polynomial poni1 € Pon41 has n+ 1 zeros of order two and there- 
fore, by Theorem 8.1, must be identically equal to zero. This implies that 
P2n+1,1 = P2n+1,2- O 


The main application of Hermite interpolation consists in the approxi- 
mation of a given function f € C'{a, 6] by interpolating its function values 
and the values of its derivative at n + 1 distinct points ro,...,@n € [a, BJ. 
By 

H,,: Cc} [a, b — Pon41 


we denote the Hermite interpolation operator that maps continuously dif- 
ferentiable functions f : [a,b] > IR into the uniquely determined Hermite 
interpolation polynomial H,f € Poni with the property 


(Hn f)(03) = f(aj), (Anf)'(a;) = fej), 7 =0,...,n 


The following theorem can be proven analogously to Theorem 8.10 (see 
Problem 8.8). 


Theorem 8.19 Let f : [a,b] — R be (2n + 2)-times continuously differ- 
entiable. Then the remainder R,f := f —Hnf for Hermite interpolation 


with n+ 1 distinct points 20,...,%n € [a,b] can be represented in the form 
fn?) (6) n \2 
(Rn f)(2) = Soy Le —2;)*, «€ [a,)], (8.13) 


for some € € [a,b] depending on z. 


8.2 ‘Trigonometric Interpolation 


In applications, quite frequently there occur periodic functions, i.e., func- 
tions with the property 


f(ti+T) = f(t), te kk, 


for some T > 0. For example, functions defined on closed planar or spatial 
curves always may be viewed as periodic functions. Polynomial interpola- 
tion is not appropriate for periodic functions, since algebraic polynomials 
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are not periodic. Therefore, we proceed by considering interpolation by 
trigonometric polynomials, which was first used independently by Clairaut 
(1759) and Lagrange (1762). Without loss of generality we assume that the 
period is equal to T' = 27. 


Definition 8.20 For n € IN we denote by T,, the linear space of trigono- 
metric polynomials 


q(t) = S- az cos kt + s- b, sin kt 
k=0 k=1 


with real (or complex) coefficients aop,...,Qn and b;,...,bn. A trigonomet- 
ric polynomial q € T, is said to be of degree n if |a,| +|b,| > 0. 


From the addition theorems for the cosine and sine functions it follows 
that qige € Trin, if ga € Tn, and qe € Tn,.. This justifies speaking of 
trigonometric polynomials. 


Theorem 8.21 A trigonometric polynomial in T,, that has more than 2n 
distinct zeros in the periodicity interval [0,27) must vanish identically; i.e., 
all its coefficients must be equal to zero. 


Proof. We consider a trigonometric polynomial q € T;, of the form 
a rm 
q(t) =— + S [as cos kt + by sin kt]. (8.14) 
7 k=1 


Setting bo = 0, 
1 1 
Yk = 5 (a, —ibp), Y¥—ri= 5 (a, +ib,), k=O,...,N, (8.15) 


and using Euler’s formula 


t 


e” = cost +isint, 


we can rewrite (8.14) in the complex form 
n . 
q(t) = S- ypet*t. (8.16) 
k=—n 
Therefore, substituting z = e* and setting 
p(z) = >> Ke", 
k=—-n 


we have the relation 
q(t) = z-"p(z). 
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Now assume that the trigonometric polynomial q € T;,, has more than 2n 
distinct zeros in the interval (0,27). Then the algebraic polynomial p € Pon 
has more than 2n distinct zeros lying on the unit circle in the complex plane, 
since the function t +> e” maps [0, 27) bijectively onto the unit circle. By 
Theorem 8.1, the algebraic polynomial p must be identically zero, and now 
(8.15) implies that also g must be identically zero. O 


Theorem 8.22 The cosine functions cy(t) := coskt, k = 0,1,...,n, and 
the sine functions s,(t) := sinkt,k =1,...,n, are linearly independent in 


the function space C(O, 27]. 


Proof. To prove this, assume that 


n n 
S- QnCh + \- bps, = O; 
k=0 k=1 


that is, 
n n 
So a: cos kt + Sb sinkt=0, t € [0,27]. 
k=0 k=1 
Then the trigonometric polynomial with coefficients ag,...,@n and b;,..., bn 


has more than 2n distinct zeros in [0, 27), and from Theorem 8.21 it follows 
that all the coefficients must be zero. Note that this linear independence 


also can be deduced from ‘Theorem 3.17. O 
Theorem 8.22 implies that the cosines c,, k = 0,1,...,n, and sines 
Sz, k =1,...,n, form a basis for T,, and that 7, has dimension 2n + 1. 


Theorem 8.23 Given 2n+1 distinct points to,...,tan € [0,27) and 2n+1 
values Yo,-.--, Yan € IR, there exists a uniquely determined trigonometric 
polynomial qn € T, with the property 


dn(t;) =y;, j =0,...,2n. (8.17) 


In the Lagrange representation, this trigonometric interpolation polynomial 
as given by 


2n 
dn = > yee (8.18) 
k=0 
with the Lagrange factors 
_ t-t; 
2n- sin 9 
é,(t) = |] — t—-t’ k=0,...,2n. 
i=o Sin ——— 


i¢k 2 
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Proof. The function g, belongs to T,, since the Lagrange factors are trigono- 
metric polynomials of degree n. The latter is a consequence of 


t-t). t-t, 1 t&-t 1 ( nate), 


nr = 5) COS 9 _ 2 COs 


sin 

i.e., each of the functions @, is a product of n trigonometric polynomials of 
degree one. As in Theorem 8.3, we have ¢,(@;) = 6j;, for j,k = 0,...,2n, 
which shows that gq, indeed solves the trigonometric interpolation problem. 
Uniqueness of the trigonometric interpolation polynomial follows analo- 
gously to the proof of Theorem 8.3 with the aid of Theorem 8.21. oO 


We now consider the important case of an equidistant subdivision 


27] 
t; = »=0,...,2n. 
J on +1? J , ” 
For this we first note the summation formula 
2n 2n 2n + 1, k= 0, 
J efkts _ Seidl _ (8.19) 
j=0 j=0 0, k= +1,...,+2n, 


which is a consequence of the fact that for e*“* 4 1 we have the geometric 


sum 
_ etl2nt+1)tk 


2n 
tjth — 
j=0 


whereas for e“* = 1 each term in the sum is equal to one. 
We now attempt to find the uniquely determined interpolation polyno- 
mial in the complex form 


n 
gn(t)= >> yee. 
k=—n 


From the interpolation conditions 
Qn(t;)=yj, jg =0,...,2n, 


we observe that solving the interpolation problem is equivalent to solving 
the system of linear equations 


S- yeti =y;, j =0,...,2n. (8.20) 


k=-n 


Assume that the coefficients 7, solve (8.20). Then, with the aid of (8.19), 
we obtain 


n 2n 


2n 
doe = ST oe Do EMS = (2 + Lym 
j=0 


k=—n j3=0 
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i.e., any solution of (8.20) must be of the form 


2n 


1 int; 
We aq Lye, k= nn..n. (8.21) 


On the other hand, again with the aid of (8.19), for y, given by (8.21) we 
have that 


n 1 2n 2n 
dy wen = 5S Yo ym > em) = ys, 5 =0,...,2n; 
k=—n m=0 k=0 


i.e., the linear system (8.20) has a unique solution, which is given by (8.21). 
From this, using the relation (8.15) between the real representation (8.14) 
and the complex representation (8.16) of trigonometric polynomials, we 
derive the following theorem. 


Theorem 8.24 There exists a unique trigonometric polynomial 
a Tr 
dn(t) = — + Slax cos kt + by sin kt] 
2 k=1 


satisfying the interpolation property 


207 ; 
— 4; =0....,2n. 
dn ("4 | Yj, J =O0,...,2n 


Its coefficients are given by 


2n . 
2 277k 
OTT LY 8 aT = Nes 
2 = onjk 
ke ond 205 ST proeal 


For an equidistant subdivision with an even number 2n of interpolation 
points 


i= —, 7=0,...,2n—-1, 


we have only 2n conditions to determine an element of the (2n + 1)-dimen- 
sional space T,,. However, since the function sin nt obviously has its zeros 
at the interpolation points, we drop it from the interpolation polynomial. 
The proof of the following theorem is completely analogous to the proof of 
Theorem 8.24. 
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Theorem 8.25 There exists a unique trigonometric polynomial 
a n—-1 a 
Qn(t) = 4 we: cos kt + by sin kt] + — cosnt 
7 k=1 2 


satisfying the interpolation property 


1%) , 
dn (=) =yj, jg =O,...,2n—1. 

n 
Its coefficients are given by 


2n—1 . 
1k 
a, = S— yj cos —— , k=0,...,n, 
n 
j=0 


2n—1 rjk 
b, = — ;sin——, k=1,...,n—-1. 
k n 2 n 


Obviously, the trigonometric interpolation polynomials of Theorems 8.24 
and 8.25 may be viewed as discretized versions of the Fourier series, where 
the integrals giving the coefficients of the Fourier series (see Problem 3.20) 
are approximated by the rectangular quadrature rule at an equidistant grid 
(see Corollary 9.27). Therefore, trigonometric interpolation on an equidis- 
tant grid is also known as the discrete Fourier transform. 

An effective numerical evaluation of trigonometric polynomials can be 
done analogously to the Horner scheme for algebraic polynomials. For the 
polynomial 


the recursion (6.10) of the Horner scheme has the form 
bp) = bez t+ep_1, kK=n,...,1, 


starting with b, = Cn, and it delivers p(z) = bp. Assuming that the coeffi- 
cients cz are real, we substitute z = e” and separate into real and imaginary 
parts, by = up +iv,z, to obtain up = Cn, Un = 0, and the recursion 


Up—_1 = up cost — vz~Sint + Ce_-1, Upg—1 = UR Sint + Vz COSTE, 
fork =n-—1,...,1. From this we find 
n nr 
ug = ) cy coskt, vp = ) Cr Sin kt; 
k=0 k=1 


i.e., the evaluation of a trigonometric polynomial at a point t can be reduced 
to the evaluation of sint and cost and O(n) additions and multiplications. 
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To compute all the coefficients a, and by in Theorem 8.24 or 8.25 by this 
approach requires O(n”) additions and multiplications. 

By the fast Fourier transform, which is attributed to Cooley and Tukey 
(1965) and which was known already to Gauss, the computational costs 
can be reduced even further. The main idea is to exploit the symmetries of 
e2713/™ if 7 is a power of two, say n = 2? with p € IN. We briefly explain the 
fast Fourier transform algorithm for the evaluation of the discrete Fourter 
transform in the complex form 


1 _ 2ni kj 
=~ \ yet, k=0,...,n—1. 8.22 
Ck n D_ vie nm ( ) 
Let m := n/2 = 27! and w := e~27/", Then w™ = 1, w™ = —1, and 


(8.22) reads 
1 n—1 
k= — Do yiw, k=0,...,n—1. 
j=0 


Now, the basic idea of the Cooley—Tukey algorithm is to break this sum 
into two parts for 7 even and 7 odd; i.e., 


where 
1 m— 1 m—l1 
—— — j — _] 
m yojw! m » Yaj+1¥ ’ k 0, » 
Since w? = e~27/™_ we have yrim = Ye and Op4m = Op, and therefore 
1 1 1 1 
Ch = 5 et 5 we dk, Chim = 5 Tk 5 W'Sk, k=0,...,m-—1. 
Obviously, the yz, 6,, k = 0,...,m—1, represent a discrete Fourier trans- 


form of length m = n/2. Hence, the discrete Fourier transform of length 
n is reduced to two discrete Fourier transforms of length n/2 followed by 
n multiplications and n additions. If this is done recursively, we arrive at 
the following operation count. Assuming that the w*, k = 1,...,m—1, are 
precomputed, let MM, denote the number of additions and multiplications 
needed for the Fourier transform of length n = 2?. Then, 


M, = 2Mp-1 4 Qptl 
with Moy = 0. From this, by induction, it follows that 


M, = p2?t' = 2nlog, n, 
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i.e., that the computational cost is reduced significantly from order O(n?) 
to order O(n log, n). 


The actual numerical implementation is based on writing the indices k 
and j in a binary representation 


p—-1 p—-l1 
k= [ko, an -) kp—1] = S> kg2!, j = Jo, nae sjp-1] = S> iq? 


with ky, jq € {0,1} for g=0,...,p—1. Then 


i . 2rtj 
em kj — 7 ap—4 thor -skp—a—a] 
q=0 


since 


for q+r > p. Inserting this into (8.22), we can split the long sum into p 
nested short sums, and the Fourier transform becomes 


1 1 
1 > 259 2ntj 
Ch = — on v4 [ko,-...Kp—1] x ) eat [hos--skp—2] 
Tm : 


jo—0 ji=0 


KX — =Tiipo} ko »,.. ; 
€ Yij0,---sdp—1}* 


Define the intermediate sums 


1 _ 
q —_ 3 on PEE [ho ,.--skq—1] 
[J05--.Jp—q—1skq—1)---ko}] 7 


i _ 2ntjp—1 ko 
Xe" X ‘ € 2 Yijo,---dp—11 


for g=1,...,p and jo,-..,jp—q—1, kq—-1;-+-) ko € {0,1}. Then clearly, 
1 


_ * op 
Clko,...kp—1] ~ n Step —1--sko)? (8.23) 
and setting 
0 — . * 
So sstp—1] — Yijoy--dp—1)) 
we have the recursive relation 
qd _ q—-1 
STiosesip—a—1skq—1s---sko] _ Sig sip —¢-1,0)kq-2y---sko] 
q—1 — 223 [Ko,..-kq—1] 


. : 2P—-@4q 
tS 6 jccosdp ont 1 ,kg—2)-.-sko] © 
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for g = 1,...,p. Each step of these p recursions requires n additions and 
n multiplications. Hence, the total computational cost is indeed of order 
O(n log, n). For more details of the actual numerical implementation and, 
in particular, on how to effectively perform the so-called bit reversal in 
order to arrange the result (8.23) in the natural order, we refer to [45]. 

The error analysis for trigonometric interpolation is more complicated 
than the error analysis of the previous section for polynomial interpolation. 
Denote by L, : C[0,27] > T;,, the trigonometric interpolation operator 
that maps the function f onto its trigonometric interpolation polynomial 
L,f. For equidistant grids, by Problems 8.12 and 8.13 we have convergence 
l|\Lnf — flle + 0, n — oo, for each continuous 27-periodic function f and 
[Ln Ff — filo 0, n — oo, for each continuously differentiable 27-periodic 
function f. For a detailed error analysis we refer to [49]. 


8.3 Spline Interpolation 


As we have seen in our considerations of the convergence of interpolation 
polynomials, increasing the number of interpolation points, i.e., increasing 
the degree of the polynomials, does not always lead to an improvement 
in the approximation. The spline interpolation that we will study in this 
section remedies this deficiency of interpolation by high-degree polynomials 
through a piecewise polynomial interpolation of low degree. 

A frequently used method of this type is piecewise linear interpolation. 
Let a = @ < 2 <-:: < Zp, = b be a subdivision of the interval [a, 6]. 
Then a given function f € Cla,b] can be approximated by a continuous 
piecewise linear function by linear interpolation on each of the subintervals, 
i.e., according to Example 8.12, by 


1 


8n(x) = to Bp [f(zj-1)(@j — 2) + f(a;)(@ —2;-1)], x € [ej-1, 25]. 


From the error estimate (8.9) for linear interpolation, we see that for piece- 
wise linear interpolation we have uniform convergence ||s, — f|l.. — 0 
for n — oo on [a,b], provided that h := maxj=i,.....|"; — 2j-1| > 0 and 
f € C*[a, db]. The main advantage of this method is its simplicity and its 
stability with respect to errors in the interpolation values. However, since 
by (8.9) linear interpolation has an error only of order O(h?), for achieving 
a prescribed accuracy it usually requires a much finer discretization than 
some of the higher-order methods described below. 


Definition 8.26 Let a = x < 4 <--- < tn = 5 be a subdivision of 
the interval [a,b] and m € IN. A function s : [a,b] > R is called a spline 
of degree m with respect to this subdivision if s is (m — 1)-times continu- 
ously differentiable on [a,b] and if the restriction of s to each subinterval 
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[z;-1,2;| forj =1,...,n reduces to a polynomial of degree at most m. By 
Sr we denote the set of all splines of degree m for a fixed subdivision. 


Although piecewise polynomials have been studied since the beginning of 
this century, the notation spline was introduced only in 1946 by Schoenberg. 
The term originates from the thin wooden or metal strips that were used 
by draftsmen to fit a smooth curve between specified points. Since small 
displacements s of a thin elastic beam are governed by the fourth-order 
differential equation s‘4) = 0, cubic splines, i.e., splines of degree three, 
indeed model the draftsmen’s splines. 


Theorem 8.27 S? is a linear space of dimension m + n. 


Proof. Clearly, S®, is a linear space, since C™~'[a,b] and Py», are linear 
spaces. In the sequel we shall use the notation 


x”, x>Q, 
ry i= 
0, x <Q, 
for m € IN. The m+ n functions 
up(2) := (a —a9)*¥, k=O0,...,m, 
(8.24) 
Ue(@) := (aw —ag)?, kK=1,...,n—-], 


are linearly independent. In order to see this, let 


m n—1 
> QpUuR + S~ Grup = 0. 
k=0 k=1 


Then, in particular, 
m 
S| ax (x —ao)"=0, x € [z0, 24], 
k=0 


whence a,x = 0 for k = 0,...,m. Then we have 
By (x2 — 41)” = 0, LE (x1, 22], 


and therefore @, — 0. Repeating this argument inductively, it follows that 
GB, =0,k=1,...,n—1. 

To complete the proof, we need to show that each s € Sf can be ex- 
pressed as a linear combination of the functions (8.24). Given a spline 
s € S$”, by induction we show that there exist constants ao,...,Q@m and 
By, wee ,Bn—-1 such that 


s(x) = SS an(2 — x9)" + S— B(x —Zp), « E [20,75], (8.25) 
k=0 k=1 
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for j = 1,...,n. This is true for 7 = 1, since on [xo, 21| the spline s coincides 
with an element of P,,. Now assume that we have the representation (8.25) 
for some 7 > 1. Then the difference 


m j-1 
p(x) := s(x) — s- a, (x — 29)* — S- By (x — 2K) 
k=0 k=1 


restricted to the interval [x;, xj+41] isin P,. Since the spline s is in C™~*[a, b] 
and p vanishes on [xo,z;| we have that 


p)(a;)=0, i=0,...,m—1. 


Hence p(x) = B;(x — 23) on [x;,2;41] for some constant G;, and because 
(x —z;) =0 on [zo, 2;], the representation (8.25) is proven forj7 +1. O 


Since the spline space S? has dimension m + n, the n + 1 interpolation 
conditions at the points xo,...,£p are not sufficient to determine uniquely 
a spline of degree greater than one. Therefore, we need to add additional 
requirements in the form of conditions at the two endpoints 79 = a and 
Zn = b of the interval. Since we want to divide the number of these end 
conditions equally between both ends, we consider only odd degrees m. 


Lemma 8.28 Let m = 2£—1 with £€ IN and £ > 2, and let f € C*[a, B. 
Assume that the spline s € S” interpolates f, i.e., 


(xj) = f(a;), §=0,-..5n, (8.26) 
and that it satisfies the boundary conditions 
sP(a)= f(a), s(b) = f%(b), FHl,...,€-1. (8.27) 


Then 


/ ‘[f(a) — s((@)Par = / ‘[f(@)Pae - / ‘[(a)Pde. (8.28) 
Proof. We have that 

/ ‘[ (x) — s\(x))?dx = / Ue (x)|?dx — / ‘fs (x)|’dx — 2R, 
where 


b 
R:= / [fF (x) — s(x)]s (2) dex. 


Since f € C*[a,b] and s € C™~![a, b] has piecewise continuous derivatives 
of order m, by @ — 1 repeated partial integrations and using the boundary 
conditions (8.27) we obtain that 


b 
R= (-1 | [f'(@) -s'@))s"™ (a) de 


172 8. Interpolation 


A further partial integration and the interpolation conditions now yield 


(-1)* vf f'(z) — s'(a)]s°™ (a) dx 


Lj 


-1F SUF(z) — (x)]8™(o)} = 0, 


fj-1 
since s(™+1) = 0. This completes the proof. O 


Lemma 8.29 Under the assumptions of Lemma 8.28 let f = 0. Then 
s=0. 


Proof. For f = 0, from (8.28) it follows that 


b 
/ (s) (x) /?dx = 0 


This implies that s‘4) = 0, and therefore s € Pe_; on [a,b]. Now the bound- 
ary conditions s4)(a) = 0, 7 =0,...,@—1, yield s = 0. O 


From the proof it can be seen that Lemmas 8.28 and 8.29 remain valid 
if the boundary conditions (8.27) are replaced by 
34H (q) = sb) =0, 7 =0,...,2-2, 
or, provided that f is periodic with period b—a, by the periodicity condition 
s‘) (a) = 89 (b), j=1,...,0-1. 


Consequently, the following conclusions drawn from Lemma 8.29 are also 
true for these two end conditions. However, from a practical point of view 
only the latter modification is of relevance. 


Theorem 8.30 Let m = 24—1 with €€ IN and € > 2. Then, givenn+ 1 


values Yo,.--,Yn and m —1 boundary data a;,...,a~¢-1 and by,...,be-1, 
there exists a unique spline s € S” satisfying the interpolation conditions 
s(zj)=yj;, j =9,...,n, (8.29) 


and the boundary conditions 
sD(a)=a;, s(b)=b;, j= 1,...,€-1. (8.30) 


Proof. Representing the spline in the form (8.25), i.e 


m n—1 
s(t) = > agug t+ >> Bere, (8.31) 
k=0 k=1 


8.3 Spline Interpolation 173 


it follows that the interpolation conditions (8.29) and boundary conditions 
(8.30) are satisfied if and only if the m+n coefficients ao,...,Q@m and 
B1,.--,B8n—1 solve the system 


m n—1 

S- agur(tj) + >> Beve(aj) =yj, J =0,---40, 

k=0 k=1 

m n—1 
S- aul (a) + >> Bev? (a) =a;, j=1,...,0-1, (8.32) 
k=0 k=1 


m n—1 
S— anus? (b) + > Bevye (b) =b;, j=l,...,@—-1, 
k=0 k=1 


of m+n linear equations. By Lemma 8.29 the homogeneous form of the 
system (8.32) has only the trivial solution. Therefore, the inhomogeneous 
system (8.32) is uniquely solvable, and the proof is finished. O 


In principle, for the actual computation of the interpolating spline, it 
is possible to use the linear system (8.32). However, as a consequence of 
the global nature of the basis functions (8.24), this system turns out to be 
ill-conditioned. Therefore, it is preferable to use the corresponding linear 
system derived from another set of basis functions known as basic splines, 
or simply B-splines. As opposed to the splines (8.24), the B-splines have 
local support, i.e., they differ from zero only within m + 1 neighboring 
subintervals. 

For the sake of simplicity we confine our analysis of B-splines to the case 
of an equidistant subdivision of step length h. We set 


1, |2| $0.5, 
Bo(2z) = 
0, |x| > 0.5, 
and define recursively 
a+s 
Bm+i(@) := / Bm(y)dy, xcrER, m=0,1,.... (8.33) 
1 


r—3 


Then, by induction, it can be seen that the B,, are (m — 1)-times con- 
tinuously differentiable and nonnegative, vanish outside the interval 
[—m/2 — 1/2,m/2+4 1/2], and reduce to a polynomial of degree m in each 
of the intervals [7,7 + 1] for m odd and [i — 1/2,1+ 1/2] for m even for i an 
integer; i.e., the B,, are splines of order m on an integer grid if m is odd 
and on a half integer grid if m is even. 
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Elementary integrations show that 
1- |x|, |x| < 1, 
B(x) = (8.34) 
0, |x| > 1, 
2 — (|x| — 0.5)? — (|x| +0.5))’, |x| < 0.5, 
1 
Bo(r) = 5 (|x| — 1.5)?, 0.5 <|zj)< 1.5, (8.35) 


0, je] > 1.5, 


(2—|a|)°-4(1—-|2/)°, [al <1, 
1 
B3(t) =F 4 (2- ||)”, 1<|z| <2, (8.36) 


0, || > 2. 


Graphs of these B-splines are given in Figure 8.1. 


NN 
ae 


FIGURE 8.1. B-splines B;, Bz, and Bs 


Theorem 8.31 For m € INU {0} the B-splines 
Bm(--—k), k=0,...,m, (8.37) 


are linearly independent on the interval I, := [@s4, mth 
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Proof. This is trivial for m = 0, and we assume that it has been proven for 
degree m — 1 for some m > 1. Let 


m 
S/ arBm(a—k) =0, © E Im. (8.38) 
k=0 


Then, with the aid of (8.33), differentiating (8.38) yields 


a 1 1 
Soar [Bm (e- b+ 5) ~ Bra (2-k-5)| =0, xE Lp. 
k=0 


Observing that the supports of Bn—1 (- + $) and Bn_1 (- — m — §) do not 
intersect with I,,, we can rewrite this as 


_ 1 
Slaw — Ap—1|Bm-1 (2 —k+ 5) =0, rE, 
k=1 


whence a, = a,_-; for k = 1,...,m follows by the induction assumption; 
i.e., a, = a for k =0,...,m. Now (8.38) reads 
m 
a) > Bm(x-k)=0, £€ Im, 
k=0 


and integrating this equation over the interval [,, leads to 


This finally implies a = 0, since the B,, are nonnegative, and the proof is 
finished. O 


Corollary 8.32 Let 7, = a+hk,k = 0,...,n, be an equidistant subdi- 
vision of the interval [a,b] of step size h = (b—a)/n with n > 2, and let 
m = 2@—1 with €€ IN. Then the B-splines 


Bm, (2) := Bn (ES) , 2« € {a,b}, (8.39) 


fork = —€+4+1,...,n+€-—-1 form a basis for S?. 


Proof. The n + m splines (8.39) belong to S”, and by the preceding The- 
orem 8.31 they can be shown to be linearly independent on [a, b]. Hence, 
the statement follows from Theorem 8.27. 0 


The use of the B-splines as a basis opens up another possibility for the 
computation of an interpolating spline. We only consider the case m = 3, 
i.e., cubic splines. From (8.36) we note that 


2 1 1 
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Therefore, the cubic spline 


a Z-2 
s(x) = 3 ar B3 ( 7 ‘) , « €fa,b, (8.40) 


k=—-1 


satisfies the interpolation conditions (8.29) and the boundary conditions 
(8.30) if and only if the n + 3 coefficients a_j,...,Q@n41 satisfy the system 


1 1 
5 OI + 5 = ha,, 
1 2 1 , 
g Minit gt EH = Yj, J =0,-.-57, (8.41) 
1 1 
5 An-1 +5 An4+1 = hb, 


of n+ 3 linear equations. Since the matrix of this system is irreducible and 
weakly row-diagonally dominant, the solution can be obtained by Jacobi 
iteration (see Theorem 4.7). 

We conclude this section with an analysis of the interpolation error for 
cubic splines and note that the results can be extended to arbitrary odd 
degree. We begin with a convergence result for arbitrary subdivisions under 
a weak regularity assumption on the interpolated function. 


Theorem 8.33 Let f : [a,b] > R be twice continuously differentiable and 
let s € S? be the uniquely determined cubic spline satisfying the interpola- 
tion and boundary conditions of Lemma 8.28. Then 


h3 


/2 
If — slloo S$ MNF lle and If! - 8'lloo < bY? WF" la, 


where h := maxj=1,....n|€j — 2j-1|- 


Proof. The error function r := f — s has n+ 1 zeros %g,...,£%n. Hence, the 
distance between two consecutive zeros of r is less than or equal to h. By 
Rolle’s theorem, the derivative r’ has n zeros with distance less than or 
equal to 2h. Choose z € [a,b] such that |r’(z)| = ||r’||... Then the closest 
zero ¢ of r’ has distance |¢ — z| < h, and by the Cauchy—Schwarz inequality 
we can estimate 


Ir, = : / “w(y)dyl <h | I oe <n f ‘e'w)Pay. 


From this, using Lemma 8.28 we obtain ||r’||.o < VA||f'Il2. 
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Choose x € [a,b] such that |r(z)| = ||r||... Then the closest zero € of r 
has distance |€ — z| < h/2, and we can estimate 
* h hVh 
Plo = | r'(w)ay} <5 Ur lleo < “3 INF 
E 
which concludes the proof. 0 


If we assume more regularity on f, we can improve on the order of 
convergence. For this we need to derive an estimate on the second derivative 
of the interpolating spline. From (8.36) it follows that 


BY(0)=—-2 and Bi(+1)=1. 


Hence, the cubic spline (8.40) has second derivatives given by the difference 
formula 


$j) = oy [aj1 —2aj taj4ih F=0,.5m. (8-42) 
From this we deduce that 
h*[s""(xj-1) + 48" (xj) + 8"(2541)] = [aj—2 + 4aj-1 + a5] 
—2[aj_-1 + 4a; + aj41] 
+[aj + 4aj41 + aj+2] 
forj7 =1,...,n—1, 
h?[4s"(29) + 2s"(21)] = 6fa_, — a] — 2[a_, + 4a9 + 24] 
+2[ao + 4a, + ag], 
and 
h?[2s"(tn—1) + 48"(2n)] = 2[an—2 + 40an—1 + ay] 
—2[Qn—1 + 4Qn + Anyi] — 6[an—1 — Qn41]. 
From this and the linear system (8.41), for the special case of the interpo- 
lation conditions (8.26) and the boundary conditions (8.27), it follows that 
the n + 1 values of s” at the grid points satisfy the system 
As" (29) + 28"(x21) = Fo, 
s"(t;-1) +48" (23) + 8"(2j41) = Fj, j=l,...,n—-1, (8.43) 


28" (2n—-1) + 48" (2n) = Fp, 
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of n + 1 linear equations with right-hand sides 
12 ; 
Fo := 55 [-f(o) + f(t1) — Af'(xo)], 


Fy : 


~ [f(@j-1) —2f (rj) + f (v541)], j=Hl,...,.n-1, 


12 
Fy = h2 [f(Zn-1) ~~ f (Zn) + hf'(tn)]. 
From the system (8.43) we can conclude that 


4\s"(03)| <|Fj|-2, max |s"(ve)|, 7 =0,...40 
=0,...,n 


and therefore 


1 
jax |s"(a5)| < 3 jax cl: (8.44) 


If f is twice continuously differentiable, by Taylor s formula we can estimate 
max{|Fo|, |Fn|} < 6 [If"lloo- 


From Example 8.12, applied to the remainder in the linear interpolation of 
f(z;) from f(z;-1) and f(zj;41), we obtain 


6 ; 
Fl = pol Fleya) ~ 2f(@)) + ej) <Ollf"lleos FS dyn 1 
Hence, since s” is piecewise linear, from (8.44) it follows that 
IIS" loo S$ BNF" Theo. (8.45) 


Theorem 8.34 Let f : [a,b] ~ IR be four-times continuously differen- 
tiable and let s € S} be the uniquely determined cubic spline satisfying the 
interpolation and boundary conditions of Lemma 8.28 for an equidistant 
subdivision with step width h. Then 


h4 
If = Slloo S Fe Flo: 


Proof. By L, : Cla, b]| > S? we denote the interpolation operator mapping 
g € C[a,}] onto its uniquely determined piecewise linear interpolation. 


From Example 8.12 we obtain that 
h2 
Irlloo = Ilr — Lirllos <> IIr"lloo: 


since trivially L,r = 0. 
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By integration, we choose a function w such that w” = L,f". Applying 
the estimate (8.45) for the cubic spline s — w and using the estimate (8.9) 
for the piecewise linear interpolation of f”, we obtain 


h? 
IFS" lloo SMF Li foo HES" —8""lloo SAILS" “La f"lloo SF ME leo: 


By piecing together the last two inequalities we obtain the assertion of the 
theorem. O 


8.4 Bezier Polynomials 


In this section we want to introduce some of the basic ideas of computer- 
aided geometric design. We will confine our presentation to planar (and 
spatial) curves, i.e., to subsets [ C IR”, m = 2,3, that can be described 
by a continuous mapping f : D > IR” of an interval D C R into R”. 
For the purposes of computer-aided geometric design it is essential that 
the geometric objects can be visualized and manipulated on the computer 
very effectively and rapidly. This, in particular, makes it essential that 
the parameters entering the representation of the curves have a geometric 
meaning. The latter property, for example, is not fulfilled by polynomial 
curves represented through the classical monomial basis. 


Definition 8.35 hor n € INU {0}, we denote by P*” the linear space of 
polynomials of the form 


n 
p(x) — Sage’, LE R, 
k=0 


where do,.--,@n € IR"™. A polynomial p € P™ is said to be of degree n if 
an ~ 0. 


We proceed by introducing a basis for polynomials on an interval [a, }] 
in IR with a < 6 that is better suited for the purposes of computer-aided 
design than the monomial basis. For this we make use of the fact that by 
the affine linear transformation 

r—a 
rH te) := 8.46 
(2) = (8.46) 
the interval [a,b] can be mapped on the interval [0,1]. By the binomial 
formula we have that 


=t+0-or= 5 (;)ta-0r 


k=0 


The terms in this partition of unity are called Bernstein polynomials for 
the interval [0,1]. From these, the Bernstein polynomials for the interval 
[a, 6] are obtained via the transformation (8.46). 
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Definition 8.36 The Bernstein polynomials of degree n for the interval 
[0,1] are given by 


Br(t) := (;) t*(1—t)""*, k=0,...,n. (8.47) 
Correspondingly, the polynomials 


n n{(«r-a 1 n n— 
B; (x; a, b) = B; (F=) = aaa (i) (x—a)* (b—z2) Ke k= 0,.. -, fl, 


are called Bernstein polynomials of degree n for the interval (a, 5]. 


Some basic properties of Bernstein polynomials are described in the fol- 
lowing theorem. 


Theorem 8.37 The Bernstein polynomials are nonnegative on [0,1] and 
provide a partition of unity; 7.e., 


BR(t)>0, +t € [0,1], (8.48) 
and . 
S| Bet) =1, te R. (8.49) 
k=0 
They satisfy the relations 
Be(t) = BY_,(1-t), k=0,...,n, (8.50) 
and 
Bo(t)=(1-t)Bo(),  Ba(t) = tBr=4 (2) (8.51) 


for allt €R andneN. The point t = 0 ts a zero of BP of order k, and 
t = 1 is a zero of order n — k. Each of the polynomials By assumes tts 
mazimum value only att = k/n. They satisfy the recursion relation 


Br(t) =tBP 7} (t)+(1-oB?*(), teR, (8.52) 


forn€IN andk =1,...,n—1. The polynomials BG,..., Bp form a basis 
of Py. 


Proof. The first five properties are obvious. The statement on the maximum 
of BP is a consequence of 


d ony — (\,b-1 n—k-1(,, __ _ 
& Beit) = (jt (1 — t) (k—nt), k=0,...n. 


The recursion formula (8.52) follows from the definition (8.47) and the 


recursion formula 
n\  (n- 1 n n—1 
k} = \k-1 k 
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for the binomial coefficients. In order to show that the n + 1 polynomials 
Be ,..., B™ of degree n provide a basis of P,,, we prove that they are linearly 
independent. Let 


> be BR(t) =0, t € [0,1]. 
k=0 
Then 


2 77 Belt) =0, ¢€ [0,1], 


and therefore 


J 


Tr d. a 
2s 7j Bi (0) =0, j=0,..-sn, 
=J 


since t = 0 is a zero of BF of order k. From this, by induction we find that 
b, =-:-=bo = 0. O) 


Definition 8.38 The coefficients bo,...,bn € IR™ tn the representation of 
a polynomial p € P™ through the Bernstein basis 


p(x) = )_ bh BE (a;a,b), x € [a,b (8.53) 
k=0 


are called control points, or Bézier points, of p. The polygon determined by 
them ts called the Bézier polygon. 


We now want to indicate that the graph of the polynomial p is closely 
related to the form of the Bézier polygon, and for this reason the graph 
of p is often referred to as Bézier curve. We first note that p(a) = bo 
and p(b) = by; i.e., both endpoints of the Bézier curve and the Bézier 
polygon coincide. Furthermore, from (8.49) it follows that the Bézier curve 
is contained in the conver hull con{bo,...,b,} of the Bézier points. The 
convex hull 


con{bo,..., bn} := {oa ap > 0, Sox = | 
k=0 k=0 


is the smallest convex set containing the points bo,...,b, (see Problem 
8.19). 
For computing the derivatives of a Bézier curve we first note that 


ii BR(t) = @ [ke*-1(1 — t)?—* — (n — k)tk(1 — t)?-F}] 
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implies that 


—nBo', k = 0, 
(BR)' =< n(Be7} - Be"), k=1,...,.n-1, (8.54) 
nBre}, k=n. 


With this identity we are ready to establish the following theorem. 
Theorem 8.39 Let 


p(t) = >) Belt), + € (0,1), 
k=0 


be a Bézier polynomial on [0,1]. Then 


eo . 
~ Fy S- A’b,B, 7(t), g=l,-..5n, 
" k=0 


pd) (t) = a 


with the forward differences Ab, recursively defined by 
A° by := by, AJ bp = AF pat — AJ~1bdy, 7=1,...,n. 


Proof. Obviously, the statement is true for 7 = 0. We assume that it has 
been proven for some 0 < 7 <n. Then with the aid of (8.54) we obtain 


, nt 22. dad, 
pI (t) = Tay DA’ be Be 
=0 


n! n—-J . a n—j—l . nie 
” apo {hoe (> Le Ale BE” 0) 


k=1 k=0 


n! n—j—1 


= aE {Abas 0%} BEIM 
"  k=0 


ee AJt1b, Be t 
In-(j +1)! 2, _ “) 
k=0 
which establishes the assertion for 7 + 1. Oo 


Corollary 8.40 The polynomial from Theorem 8.39 has the derivatives 
n! 
(n — j)! 


at the two endpoints. 


p(0) = Aj bo, p(1) = A bn; 


(n— 9)! 
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From Corollary 8.40 we note that p“ (0) depends only on bo,..., 5; and 
that p)(1) depends only on bp_j;,...,6n. In particular, we have that 


p'(0) = n(bi — bo), =p’ (1) = n(bn — bn-1); (8.55) 


i.e., at the two endpoints the Bézier curve has the same tangent lines as 
the Bézier polygon. Through the affine transformation (8.46) these results 
on the derivatives carry over to the general interval [{a, 6]. 


by 


by 


bo bo 
FIGURE 8.2. Bézier polynomials of degree two 


Figure 8.2 illustrates by two Bézier polynomials of degree two in IR? how 
the shape of the curve is influenced by the location of the control points );. 
From (8.55) we also observe how to patch two Bézier polynomials of degree 
two together smoothly such that the tangent lines at the joints coincide, 
i.e., such that the two polynomials match up to a Bézier spline of degree 
two. The Bézier polynomials have the same tangent lines at the joints if 
the Bézier polygons do. This is illustrated by Figure 8.3. 


FIGURE 8.3. Bézier spline of degree two 


We will conclude this section by describing the de Casteljau algorithm 
as a very stable and fast method for computing the function values p(t) of 
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a Bézier polynomial. Given a Bézier polynomial 
nr 
p(t)= > beBR(t), t € [0,1], 
k=0 


we define the subpolynomials b* € Pi” by 
k 
h(t) = S- bi45 BF (t) (8.56) 
j=0 


for 7 = 0,...,n —k and k = 0,...,n. For polynomials on [a,b] we have 
an analogous definition for the subpolynomials. The subpolynomial bf is 
a polynomial of degree k and has the k + 1 control points 6;,...,bj4%. 
In particular, we have that bj = p. Analogous to the Neville scheme of 
Theorem 8.9 we have the following recursion formula, which is the basis of 
the de Casteljau algorithm. 


Theorem 8.41 The subpolynomials b* of a Bézier polynomial p of degree 
n satisfy the recursion formulae 


b(t) = (1 — t)by*(t) + they) (8.57) 
fori =0,...,n -k andk=1,...,n 


Proof. We insert the recursion formulae (8.51) and (8.52) for the Bernstein 
polynomials into the definition (8.56) for the subpolynomials and obtain 


be (t) = )+ y bj; BR (t) + bi+4 BE (t) 
k—1 
= S> bi4;(1— t) BE" ( + Doha tl 7 
7=0 
= (1-t)b;'(t) + tr (0, 
which establishes (8.57). Oo 


Since 6?(t) = p(t), starting the recursion with b2(t) = b,, from (8.57) we 
can compute p(t) by successive convex combinations of the Bézier points 
bo,...,6,, which clearly is a numerically stable procedure. Since (8.57) is 
similar in structure to the divided differences in Definition 8.4, the compu- 
tations can be arranged in a tableau analogous to the one for the divided 
differences. 

From the coefficients of the de Casteljau tableau we can construct two 
Bézier polynomials on the subintervals [0,¢] and [t,1] that coincide with 
the original Bézier polynomial on the full interval [0, 1]. 


8.4 Bézier Polynomials 185 


Theorem 8.42 The Bézier polynomials 


pi(z) = > bb (t)Bg(a;0,t) and po(x):= S_ bg -*(t) BR (2; t, 1) 
k=0 k=0 


with the coefficients bk and br-* fork =0,...,n defined by the recursion 
(8.57) satisfy 

p(z) = pi(x) = po(z), «eR, 
for arbitrary 0 <t <1. 


Proof. Inserting the equivalent definition (8.56) of the subpolynomials and 
reordering the summation, we find that 


n ok n n 
pi(z) = >> > bj BR(t) BR (a; 0,t) = 5b; 5_ BF (t) BR (2; 0, t). 
k=0 j=0 j=0  k=j 
Hence the proof will be concluded by showing that 
S © BE (t)BR(2;0,t) = BR(x), xeER. (8.58) 
k=j 
To establish this identity we make use of Definition 8.36 and obtain with 
the aid of the binomial formula that 


> B; (t) Bg (2; 0,t) = 3 (‘) (1 — t)F-Sps-” @ ok(t— 2)" 


k=] k=) 


n (7-2) a t)*-5 gk (4 _ a)r—* 


= (je-am 


Hence (8.58) is valid, and consequently p; = p. The proof of po = p is com- 
pletely analogous, and it can also be obtained by a symmetry argument 
from p; = p. 0 


A natural choice in the subdivision of Theorem 8.42 is to break the 
interval in half by taking t = 1/2. Successively repeating the subdivision 
leads to a sequence of Bézier polygons that converges rapidly enough to 
the original Bézier curve to make this subdivision algorithm practical for 
an effective visualization of the curve on a computer. 
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Problems 


8.1 Let ui,...,un € Cla, b] be linearly independent and let 21,...,2n € [a,}] 
be distinct. For given values yi,...,yn € IR consider the interpolation problem 
of finding a function u € U, := span{ui,...,un} with the property 


u(z;) = yj, j=l,...,n. 


Show that the following three properties are equivalent: 

(a) The interpolation problem is uniquely solvable for each given set of values 
Yi,---,Yn € R. 

(b) Each function u € U with zeros u(x;) = 0 for j = 1,..., vanishes identically. 
(c) The n x n matrix with entries u,(z;) for j,k = 1,...n is regular. 


8.2 Consider the interpolation of f(x) := 2* by a polynomial p € P3 with the 
four interpolation points —1,0, 1,2. Discuss the behavior of the error p— f in the 
interval [—1, 2]. 


8.3 Write a computer program for the Neville scheme of Theorem 8.9. 


8.4 Show that the interpolation operator L, : Cla,b] + P, given by (8.7) is a 
linear operator. Show that it is a bounded operator if both the domain and range 
space are equipped with the maximum norm. 


8.5 Let vo,...,2n € IR be n+ 1 distinct points. Show that the Vandermonde 
matriz V with entries (x*) for j,k = 0,1,...n has determinant 


detV= |] (a; <x). 


O0<j<k<n 
8.6 Verify numerically the findings of Runge described in Example 8.14. 
8.7 Verify the relations (8.12) for the Hermite factors. 


8.8 Prove Theorem 8.19, i.e., the representation of the remainder in Hermite 
interpolation. 


8.9 Given a twice continuously differentiable function f : [a,b] > IR and three 
points Zo, £1, £2 € [a,b] with ro # x2, show that there exists a unique polynomial 
p € Ps; for which 


p(to) = f(zo), p(ai)=f(e1), p (t1)=f" (x1), p(e2) = f(z2). 


Find a representation of the polynomial and give a representation of the re- 
mainder analogous to Theorem 8.10. (This is an example of Hermite—Birkhoff 
interpolation. ) 


8.10 Inverse interpolation can be used to solve nonlinear equations f(x) = 0 
approximately by interchanging the roles of interpolation points and interpolation 
values. Find an approximation of the zero x = 1.5 for f(x) = (4x+1)* — 343 from 
the values of f at the four points 2 = 0,1, 2,3 by inverse cubic interpolation, i.e, 
by interpolating the inverse of f by a cubic polynomial with interpolation points 
f(0), (1), f(2), f(3) and interpolation values 0, 1,2,3. For the computation use 
the Neville scheme. Are you satisfied with the accuracy of the result? 
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8.11 For the trigonometric interpolation from Theorem 8.24 with 2n+1 equidis- 
tant interpolation points show that the Lagrange factors are given by 
é.(t) = F(t—t,), k=0,...,2n, 
where 
; 1 
1 sin(n+5+1)¢ 
F(t) := —— ——__>_——_ 
(f) n+1 


sin — 
2 


for t # 0, +27,+47,.... Prove that 


20 
27 


8.12 For the trigonometric interpolation from Theorem 8.24 with 2n+1 equidis- 
tant interpolation points show that 


[Ln f — flle — 0, nm —> OO, 


for each continuous 27-periodic function f. 
Hint: With the aid of Problem 8.11, show that 


Zngll2 < V2z |Iglloo 


for all n € IN and all continuous 27-periodic functions f and use the Weierstrass 
approximation theorem for periodic functions. 


8.13 For the trigonometric interpolation from Theorem 8.24 with 2n+1 equidis- 
tant interpolation points show that 


I|Znf — flloo + 0, n + oo, 


for each continuously differentiable 2x periodic function f. 
Hint: For the functions f;,(t) := e'*’ show that 


I|Ln fr — frlloo <2 


for n = 1,2,... and k = 0,+1,+2,..., and use the fact that the Fourier series 
for continuously differentiable functions is uniformly convergent. 


8.14 Write a computer program for the fast Fourier transform. 


8.15 Given n distinct points 21,...,2n ¢ [a,b], n distinct points 71,...,2n in 
[a, b], and n values yi,...,yn € IR, show that there exists a unique function of 
the form , 

ak 
u(z) = » Lk + Zk 
k=1 
with real coefficients a1,...,@n such that 


u(z;)=y;, ju=l,...,n. 
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8.16 Verify the relations (8.34)—(8.36) for B-splines. 


8.17 Use the fact that the second derivative of a cubic spline is a piecewise linear 
function to derive the linear system (8.43) without using the B-spline (8.36). 
Hint: On each subinterval integrate the piecewise linear function for s” twice and 
eliminate the integration constants through the interpolation conditions. Then 
use the continuity of s’ to obtain the linear system. 


8.18 For the Bernstein polynomials show that 
mr k ; 
> ~ Be) =, te R, 
k=0 


and 
Tr 


t 


~1 
aT P+, teER. 
Tr 


k? n 
— B,(t) = 
k=0 


8.19 Show that the convex hull 


con{bo,...,b0n}:= {Soest aR > O, S/ ax = 7 
k=0 k=0 


of n+ 1 points bo,...,b, € IR™ is convex and that con{bo,...,bn} C U for each 
convex set U with bo,...,6n € U. 


8.20 Give the Bézier representation of the (cubic) Hermite factors of Theorem 
8.18 for the case of two interpolation points. Draw the graphs of the Hermite 
factors and their Bézier polygons. 


9 


Numerical Integration 


Numerical integration formulae, or quadrature formulae, are methods for 
the approximate evaluation of definite integrals. They are needed for the 
computation of those integrals for which either the antiderivative of the in- 
tegrand cannot be expressed in terms of elementary functions or for which 
the integrand is available only at discrete points, for example from exper- 
imental data. In addition and even more important, quadrature formulae 
provide a basic and important tool for the numerical solution of differential 
and integral equations, as we shall see in Chapters 10, 11, and 12. 

The evaluation of planar areas bounded by curves is one of the oldest 
problems in science. Attempts to measure the area bounded by circles, 
ellipses, and parabolas were undertaken already by the Babylonians, Egyp- 
tians, and Greeks. However, a systematic analysis only became possible 
after the invention of calculus. Newton interpolated functions at equidis- 
tant points and integrated the interpolating polynomial and thus invented 
what now is known as the Newton—Cotes quadratures. Describing these in- 
terpolatory quadrature formulae will be the subject of Sections 9.1 and 9.2. 
Gauss was the first to notice that nonequidistant interpolation points lead, 
in general, to better accuracy for the resulting approximations to the inte- 
grals. In 1814 he presented a paper entitled “Methodus nova integralium 
valores per approximationem inveniendi” introducing quadrature formulae 
with the degree of accuracy considerably improved as compared with the 
Newton—Cotes formulae. These Gaussian quadrature formulae will be the 
subject of Section 9.3. The remaining part of this chapter is based on the 
Euler—Maclaurin expansion, which was found and published independently 
by Euler (1738) and Maclaurin (1737). We shall first employ the Euler- 
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Maclaurin expansion in our analysis of numerical integration of periodic 
functions. We will then use it to develop Romberg integration as a typical 
example for the use of the extrapolation method in order to increase the de- 
gree of accuracy. And finally, for integrands with endpoint singularities we 
will describe quadrature formulae that are based on a mesh that is graded 
towards the endpoints, and we will analyze the error with the help of the 
Euler~Maclaurin expansion. 

For a comprehensive study of numerical integration methods including 
multidimensional integration we refer to [9, 17, 21, 57]. 


9.1 Interpolatory Quadratures 
The most common quadrature formulae approximate the definite integral 
b 
Q(f) = | fleas (9.1) 


of a continuous function f over the interval [a,b] with a < b by a weighted 
sum 


nr 
Qn(f) = Do anf (zx) (9.2) 
k=0 
with n +1 distinct quadrature points xo,...,2n € [a,b] and quadrature 
weights ag,...,@n € IR. As one of the main applications of interpolation 


as developed in the previous chapter, an important group of quadrature 
formulae is obtained by integrating an interpolating polynomial instead of 
the integrand f, i.e., by approximating 

b b 

[ t@aex f tnt\@)ar 

a a 
where L,, : C[a,b] + P, denotes the polynomial interpolation operator 
with interpolation points x2o,...,2n introduced in Section 8.1 (see (8.7)). 


Note that both the integral Q and the quadrature formula Q, represent 
linear operators from Ca, b] into R. 


Theorem 9.1 The polynomial interpolatory quadrature of order n defined 


by 
b 
Qn(f) = f (Lnf)(0) dz (9.3) 
ts of the form (9.2) with the weights given by 
b 
a, = ———~ / Qn+i(Z) gy k=0,...,n, (9.4) 
In+1 (r}) qa L& Lk 


where Qn+1(@) := (% — fo) +++ (@— Zn). 
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Proof. From (8.2) we obtain 


b n b 
[Ens \(a) de = flax) [ bal@) ae 
a k=0 a 


with ; bon 
LX; 
apr = / Ly (2) dx = / a dz, 
a a dH LE — Lj 
i#k 
whence (9.4) follows by rewriting the product. oO 


The following theorem describes an equivalent definition of polynomial 
interpolatory quadratures. 


Theorem 9.2 Given n+ 1 distinct quadrature points ro,...,2n € [a,)], 
the interpolatory quadrature (9.3) of order n its uniquely determined by its 
property of integrating all polynomials p € Py exactly, t.e., by the property 


n b 
Yaxp(er) = f pla) de (9.5) 
k=0 a 


for allp€ Py. 
Proof. From (9.3) and L,p = p for all p € P, it follows that 


b 


n b 
axp(te) = | (Lnp)(x)dx = | p(x) dz; 
> Pk / / 


a 


i.e., the quadrature is exact for all p € P,. On the other hand, from (9.5) 
we obtain 


n n b 
Sans lee) = Do ae(Lnf (ex) = [Ln )(2) dx 
k=0 k=0 a 


for all f € Cla, b]; i-e., the quadrature is an interpolatory quadrature. O 


Theorem 9.3 The polynomial interpolatory quadrature of order n with 
equidistant quadrature points 


xpa—atkh, k=0,...,n 


and step width h = (b—a)/n is called the Newton—Cotes quadrature formula 
of order n. Its weights are given by 


n—k 
at = rn et ff Ie —j)dz, k=0,...,n, (9.6) 


ek 


and have the symmetry property a, = Qn—~, k = 0,...,n. 
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Proof. The weights are obtained from (9.4) by substituting ¢ = x9 + hz 
and observing that 


Qnti(x) = h"*! T] (z- 35) 


j=0 
and 
Gi41 (te) = (-1)"-Fk! (n — kth”. 
The symmetry ay, = an—,» follows by substituting z = n — y. 0 


These quadrature formulae were first discovered by Newton and also 
carry the name of Cotes because of his systematic account of Newton’s 
integration rules in 1711. The Newton—Cotes quadrature formula of order 
n = 1 is known as the trapezoidal rule. Its weights can be obtained ei- 
ther from evaluating (9.6) or more easily from the exactness conditions of 
Theorem 9.2. For the interval [—1, 1], these conditions are given by 


1 
ay +a, = | dx = 2, 
-1 


1 
—ag + a1 =| xdzx=0, 


—1 


and imply that @9 = a, = 1. Hence, for a general interval the trapezoidal 
rule has the form 


p b-—a h 
| oyae es * [F(@) + FO) = 5 lf a0) + F(0s)] 


Geometrically speaking, the trapezoidal rule approximates the integral of 
f by the integral of the straight line connecting the two points (a, f(a)) 
and (6, f(b)). Hence, the approximate value coincides with the area of the 
trapezoid with the four corners (a,0), (b,0), (a, f(a)), and (6, f(0)). 

The Newton—Cotes quadrature formula of order n = 2 was already known 
to Kepler in 1612 and Cavalieri in 1639 and is called Simpson’s rule, since 
Simpson rediscovered it in 1743. Its weights are obtained from the exactness 
conditions 


1 
a tay +ag= f de =2, 
—] 


1 
ao +a = | xdzx = 0, 


—1 


: 2 
ao ta=f gdr=5, 


—1 
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which imply that ag = ag = 1/3 and a, = 4/3. Hence, for a general interval, 
Simpson’s rule is given by 


[steraen "=" [sto +47 (3°) + s0)]= 5 london) +e) 


Geometrically speaking, Simpson’s rule approximates the integral of f by 
the integral of the parabola through the three points (a, f(a)), (ote f( af?)), 


and (b, f(b)). 
Table 9.1 gives the weights of the first four Newton—Cotes formulae (with 
the common factor h = (b — a)/n omitted). 


TABLE 9.1. Weights of Newton—Cotes formulae 


Trapezoidal rule 
Simpson’s rule 


Newton’s three-eights rule 


1 1 
2 2 
1 A 
3 3 
3 9 
8 8 


Milne’s rule 


P| 
OU 
ele 
oy 


For n > 8 some of the weights of the Newton—Cotes formulae become 
negative (see Problem 9.4). Since this might lead to negative approxima- 
tions for integrals with positive integrands, the higher-order Newton—Cotes 
rules cannot be recommended for numerical purposes. 

We will carry out the error analysis for the Newton—Cotes formulae only 
for the two most important cases, n = 1 and n = 2, i.e., the trapezoidal 
rule and Simpson’s rule. 


Theorem 9.4 Let f : Cla,b] + R be twice continuously differentiable. 


Then the error for the trapezoidal rule can be represented in the form 


b b—a he, 
| fayae --$* Ue) + FO) =- GF" (9.7) 


with some E € [a,b] and h = b—a. 


Proof. Let [,f denote the linear interpolation of f at the interpolation 
points 49 = a and x; = b. By construction of the trapezoidal rule we have 
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that the error 


b —a 
Bx(f)= [ f(@)de —"5* (f(a) + FO) 
is given by 


b b _ x 
Ef) = | f@)-(aint@lae = [ (@- aye) FA ae 


Since the first factor of the integrand is nonpositive on [a,b] and since 
by ?Ho6pital’s rule the second factor is continuous, from the mean value 
theorem for integrals we obtain that 


f(z)= (sf) [ae — by ae 
(@—a)(z ~b) [ a)(x —b)d 


for some z € [a,b]. From this, with the aid of the error representation for 
linear interpolation from Theorem 8.10 and the integral 


E,\(f) = 


b 
/ (x —a)(x — b) dz = —| 
the assertion of the theorem follows. oO 


We explicitly note that (9.7) cannot be obtained by integrating the in- 
terpolation error representation (8.8), since we do not know whether the 
intermediate point € in (8.8) depends continuously on z. 

By construction, Simpson’s rule integrates polynomials of degree less 
than or equal to two exactly. In addition, it also integrates polynomials of 
degree three exactly. By linearity, to show this it suffices to prove it for one 
polynomial of degree three. For the polynomial 


q3(x) = (@ — o)(x — 41) (4 — 22) 


both the integral and the value obtained from Simpson’s rule are zero. 
Hence, this polynomial of degree three is integrated exactly by Simpson’s 
rule. 


Theorem 9.5 Let f : C[la,b] > IR be four-times continuously differen- 
tiable. Then the error for Simpson’s rule can be represented in the form 


[saya "4 [re sar (S*) +10] =-G 1° 8 


for some € € [a,b] and h = (b- a)/2. 
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Proof. Let L2f denote the quadratic interpolation polynomial for f at the 


interpolation points rp = a, 21 = (a + b)/2, and x2 = b. By construction 
of Simpson’s rule we have that the error 


=" | ra) 44s (SF) +70) 


6 
B,(f) = / f(a) — (Lof)(2)] ae. (9.9) 


Consider the cubic polynomial 


b 
B2(f) = / f(x)d 


is given by 


p(x) := (Le f)(x) + [(Lof)'(#1) — f"(21)] 43(2), (9.10) 


4 
(b — a)? 


where q3(x) = (4 — 2o)(% — 21) (x — £2). Obviously, p has the interpolation 
properties 


P(e) = f(z), &=0,1,2, and p'(21) = f'(z1). 


Since f. q3(x) dx = 0, from (9.9) and (9.10) we can conclude that 


b 
Ea(f) = [ [f(@) - palace, 
and consequently 


f(a)-p(z) 


(x — 20)(x — 21 )?(x — £2) 


b 
B,(f) = / (x — x0)(a — 21)?(a — 22) 


As in the proof of Theorem 9.4, the first factor of the integrand is non- 
positive on [a,b], and the second factor is continuous. Hence, by the mean 
value theorem for integrals, we obtain that 


f(z) — p(2) 


(z — 20) (z — £1 )2(z — 22) 


E2(f) = [ (x — xo)(x — 21)" (2 — 22) dr 


for some z € [a,b]. Analogous to Theorem 8.10, it can be shown that 


f(2) — vlz) = L5 © (e — a)(z— 2)%(e— 20) 
for some € € [a, b]. From this, with the aid of the integral 
b _ a)d 

I (x — 9)(x — 21)*(x — 22) dx = - 


we conclude the statement of the theorem. oO 
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Example 9.6 The approximation of 


1 1 
In2e— {1+ -—| = 0.76. 
n 5 | + 4] 0.75 
For f(z) := 1/(1+ 2x) we have 
he, 1 
= UIP lleo = = 
and hence, from Theorem 9.4, we obtain the estimate | In 2 — 0.75] < 0.167 


as compared to the true error In2 — 0.75 = —0.056.... 
Simpson’s rule yields 


1 4 1 25 
In2a—- j}1+—+-| == =0.6944... 
ne 6 | TTyit | 36 , 
and from Theorem 9.5 and 
h° 1 
yA 
we find the estimate | ln 2 — 0.6944| < 0.0084 as compared to the true error 
In 2 — 25/36 = —0.0012.... O 


In order to increase the accuracy, instead of using higher order Newton— 
Cotes rules it is more practical to use so-called composite rules. These 
are obtained by subdividing the interval of integration and then applying a 
fixed rule with low interpolation order to each of the subintervals. The most 
frequently used quadrature rules of this type are the composite trapezoidal 
rule and the composite Simpson’s rule. 

Let sr, =a+kh, k =0,...,n, be an equidistant subdivision with step 
size h = (b—a)/n. Then the composite trapezoidal rule is given by 


THA) = 5 Flo) + flan) +04 Sena) + 5 Fen) 


for f € C[a, ]. 


Theorem 9.7 Let f : [a,b] > IR be twice continuously differentiable. Then 
the error for the composite trapezoidal rule is given by 


b-—a 


b 
[ s@ae— TH) = -FS HP" 


for some € € [a, B]. 
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Proof. By Theorem 9.4 we have that 
[ seyae 0 Do £"(&) 
where a < €; Sf <+-+ S &n <5. From 
n min. f"\z) < 3 fe) Sn max. f "(a) 
and the continuity of f’ we conclude that there exists € € [a,b] such that 
Ss") = nf"), 
k=1 


and the proof is finished. O 


Let n be even. Then the composite Simpson’s rule is given by 


Sn(f) 2= 5 [F(0) + 4f (a1) + 24 (2) + Af (00) + 2f (ta) 


+ +++ +2f(@n—2) + 4f(tn-1) + f(tn)] 
for f € Cla, b]. Its error can be represented and estimated as follows. 


Theorem 9.8 Let f : [a,b] > R be four-times continuously differentiable. 
Then the error for the composite Simpson’s rule is given by 


b 
[sade — Sy) =~ HF 


for some € € |a, 5]. 


Proof. Using Theorem 9.5, the proof is analogous to the proof of Theorem 
9.7. oO 


Table 9.2 gives the error between the exact value of the integral from Ex- 
ample 9.6 and its numerical approximation by the composite trapezoidal 
rule and the composite Simpson’s rule. Clearly, if the number n of quadra- 
ture points is doubled, i.e., if the step size h is halved, then the error for 
the trapezoidal rule is reduced by the factor 1/4 and for Simpson’s rule by 
the factor 1/16, as predicted in Theorems 9.7 and 9.8. 
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TABLE 9.2. Trapezoidal and Simpson’s rule for Example 9.6 


Trapezoidal rule 


—0.05685282 
—0.01518615 —0.00129726 


—0.00387663 —0.00010679 
—0.00097467 —0.00000735 
—0.00024402 —~0.00000047 
—0.00006103 —0.00000003 


9.2 Convergence of Quadrature Formulae 


Definition 9.9 A sequence (Q,,) of quadrature formulae is called conver- 
gent if 


b 
Qn(f)- Q(f) = | fla)dz, n+ 
for all f € Cla, B]. 
Theorem 9.10 (Szego) Let 


Qn(f) = > ag” F(a”) 
k=0 


be a sequence of quadrature formulae that converges for all polynomials, 1.e, 
lim Qn(p) = Q(p) (9.11) 
n—- Oo 


for all polynomials p, and that is uniformly bounded, 1.e., there exists a 
constant C' > 0 such that 


So lag? |<C (9.12) 
k=0 


for alln € IN. Then the sequence (Q,) is convergent. 


Proof. Let f € Cla, b] and € > 0 be arbitrary. By the Weierstrass approxi- 
mation theorem (see [16]) there exists a polynomial p such that 


E 
— < 


Then, since by (9.11) we have Qn(p) - Q(p) as n — ov, there exists 
N(e) € IN such that 


lQn(p) — Q(p)| < 


oe) 
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for all n > N(e). Now with the aid of the triangle inequality and using 
(9.12) we can estimate 


lQn(f) — Q(f)| < 3 lah | F(a) — p(a™)| + |Qn(p) — Q(p)| 


b 
+ | Iw) - falar 


c Ce roan (b-aje _ 
—2C+b-a) 2 2C+b-a) 
for all N > N(e); ie., Qn(f) > Q(f) for n - oo. o 


A quadrature formula 
Qn(f) = 3 arf (xk) 
k=0 


defines a bounded linear operator Q, : C[a,b] + IR with the norm given 
by 


Qnlloo = >- Jaxl. (9.13) 


To prove this, we note the estimate 


Qnfl S MIF lloo D_ laal, 


k=0 


which implies that Q, is a bounded operator and that the operator norm 
is less than or equal to the right-hand side of (9.13). Equality in (9.13) 
follows by choosing f to be a continuous piecewise linear function with 
IlFllo = 1 and f(z,)a, = |ax| for k = 0,...,n. From (9.13) and the 
uniform boundedness principle, Theorem 12.7, it can be seen that the two 
conditions of Theorem 9.10 are also necessary for convergence of a sequence 
of quadrature formulae. 


Corollary 9.11 (Steklov) Assume that the sequence (Qn) of quadrature 
formulae converges for all polynomials and that all the weights are nonneg- 
ative. Then the sequence (Qn) is convergent. 


Proof. This follows from 
n n b 
S- ai” | = S- a,” = Q,(1) > / dr =b-—a, no, 
k=0 k=0 a 


and the preceding Theorem 9.10. 0 
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From Theorems 9.7 and 9.8 and Corollary 9.11 we observe that the com- 
posite trapezoidal rule and the composite Simpson’s rule are convergent. 
On the other hand, using the fact that the conditions of Theorem 9.10 are 
necessary for convergence, it can be shown that the Newton—Cotes quadra- 
tures do not converge for all continuous functions (see Problem 9.5). 


9.3 Gaussian Quadrature Formulae 


Given the arbitrary quadrature points 20,...,%n in [a,b], the quadrature 
weights ao,...,@n of a polynomial interpolatory quadrature are determined 
such that all polynomials of degree less than or equal to n are integrated 
exactly. In this section we will examine the problem of whether the quadra- 
ture points can be chosen in such a way that polynomials of degree less than 
or equal to 2n + 1 are also integrated exactly. Obviously, to achieve this 
degree of exactness the quadrature points and the quadrature weights have 
to satisfy the conditions 


n 


b 
Yast = | v'dz, i=0,...,2n+1. 
a 


k=0 
We shall see that this system of 2n + 2 nonlinear equations for the 2n + 2 
unknowns 2o,...,2n € [a,b] and ao,..., a, € R has a unique solution and 
that for this solution the points xo,..., 2, are distinct. 


We shall proceed slightly more generally by considering quadrature for- 
mulae for the integral 


b 
Q(f) := / w(x) f(a) de, (9.14) 


where w denotes some weight function. We assume that w : (a,b) — R 
is continuous and positive and that the integral f w(x) dx exists. Typical 
examples are given by 


w(ix)=1, w(r)=V1-2?, w(r)= as ; 


where for the two latter cases the interval is assumed to be [a, b] = {[—1, 1]. 
Analogously to the case w(x) = 1, interpolatory quadrature rules for (9.14) 
are obtained by replacing f through its interpolation polynomial L,,f and 
then integrating exactly, i.e., by approximating Qf through 


b 
Qn(f) =| w(x)(Lnf)(x) dz. 


Note that the separation of a weight function w for interpolatory quadra- 
ture formulae has the advantage that in general, wL,f is a better ap- 
proximation to wf than L,(wf) due to possible singularities of w and its 
derivatives at the endpoints of the interval. 
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Definition 9.12 A quadrature formula 


b n 
[ w@s@ dex Yar f(er) 


k=0 


with n+1 distinct quadrature points is called a Gaussian quadrature formula 
if wt integrates all polynomials p € Pon41 exactly, i.e., of 


n b 
Y anplar) = [ w(2)p(e) az (9.15) 
k=0 a 
for all p € Panii. 
Lemma 9.13 Let xo,...,@2n be then +1 distinct quadrature points of a 


Gaussian quadrature formula. Then 


b 
/ w(x)dn41(2)a(e) dx = 0 (9.16) 


for dn4i(x) := (@ — 29)--- (2 —2y) and all q € Py. 
Proof. Since qn4i¢ € Pon41 and gn+i(2,) = 0, we have that 
b n 
Jw @anss (wale) de = > andy (te)alte) = 0 
a k=0 
for all g € Pp. O 


Lemma 9.14 Letzo,...,@2n ben+1 distinct points satisfying the condition 
(9.16). Then the corresponding polynomial interpolatory quadrature is a 
Gaussian quadrature formula. 


Proof. Let L,, denote the polynomial interpolation operator for the interpo- 


lation points 29,...,2%n- By construction, for the interpolatory quadrature 
we have 
n b 
Sracf(ar) = | w(2)(Enf)(e) de (9.17) 
k=0 a 


for all f € Cla, b]. Each p € Pon+) can be represented in the form 


p=Lnppt Gn+19 


for some q € Py, since the polynomial p — Lyp vanishes at the points 
L0,---,pn. Then from (9.16) and (9.17) we obtain that 


b b n 
[ w@pte)ar = [ w(0)(Enp)(v) de = S° anpler) 


k=0 


for all p € Pons. C] 
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Lemma 9.15 There exists a unique sequence (qn) of polynomials of the 
form qg = 1 and 


Qn(xz) = 2" +rp-i(z), n=1,2,..., 


with rn_1 € Py_1 satisfying the orthogonality relation 


b 
/ w(2)dn(2)qm(e)dx =0, n#m, (9.18) 


and 
P, = span{go,---,Qn}, m=0,1,.... (9.19) 


Proof. This follows by the Gram—Schmidt orthogonalization procedure from 
Theorem 3.18 applied to the linearly independent functions u,(r) := 2” 
for n = 0,1,... and the scalar product 


b 
(f,9) = / w(x) f (2) g(x) dx 


for f,g € Ca, b]. The positive definiteness of the scalar product is a conse- 
quence of w being positive in (a,b). O 


Lemma 9.16 Each of the orthogonal polynomials qn, from Lemma 9.15 
has n simple zeros in (a,b). 


Proof. For m = 0, from (9.18) we have that 


[ w(r)gn(x) dx = 0 


for n > 0. Hence, since w is positive on (a,b), the polynomial g, must 
have at least one zero in (a,b) where the sign of g, changes. Denote by 
L1,-..,&m the zeros of gp in (a,b) where qn, changes its sign. We assume 
that m <n and set r(x) := (x — 41)---(@ — 2m). Then ry € Pp and 
therefore 


b 
/ w(2)rm(2)qn (a) de = 0. 


However, this integral must be different from zero, since rmq, does not 
change its sign on (a,b) and does not vanish identically. Hence, we have 
arrived at a contradiction, and consequently m = n. 0 


Theorem 9.17 For eachn = 0,1,... there exists a unique Gaussian quad- 
rature formula of order n. Its quadrature points are given by the zeros of 
the orthogonal polynomial qn41 of degreen+ 1. 


Proof. This is a consequence of Lemmas 9.13-9.16. O 


9.3 Gaussian Quadrature Formulae 203 
Theorem 9.18 The weights of the Gaussian quadrature formulae are all 
positive. 
Proof. Define 
2 
fx (u) = ae , k=0,...,n. 


tL — XE 
Then . ; 
anlanes(an)l? = > a;felz;) = | w(2)fela)de > 0 
j=0 q 
since f, € Po,, and the theorem is proven. oO 


Corollary 9.19 The sequence of Gaussian quadrature formulae is conver- 
gent. 


Proof. For each polynomial p we have 


b 
Qn(p) = / w(x)p(x) de, 


provided that 2n + 1 is greater than or equal to the degree of p. From their 
proofs it is obvious that Theorem 9.10 and its Corollary 9.11 remain valid 
for the integral with the weight function w. Hence, the statement of the 
theorem follows from Theorem 9.18. oO 


Theorem 9.20 Let f € C?"*?[a,b]. Then the error for the Gaussian 
quadrature formula of order n is given by 


b n (2n+2) b 
[ waste) ae - Yo arster) =F? | wlorlans (Pac 
a k=0 ; a 


for some € € [a, }]. 


Proof. Recall the Hermite interpolation polynomial H,f € Pon41 for f 
from Theorem 8.18. Since (H,f)(z,) = f(r,), k = 0,...,n, for the error 


b n 
En(f):= [ w(a)f(a) de — > anf (es) 
a k=0 


we can write b 
En(f) =f w(e)lF(@) ~ (Hn f)(0)] de. 


Then as in the proofs of Theorems 9.7 and 9.8, using the mean value the- 
orem we obtain 


g,(f) = 12) — a fV@) 


b 
[an4.1(2)]? / w(x) [dn41 (x)]? dx 


for some z € [a,b]. Now the proof is finished with the aid of the error 
representation for Hermite interpolation from Theorem 8.19. 0 
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Example 9.21 We consider the Gaussian quadrature formulae for the 
weight function 


w(x) = x € {—1, 1]. 


1 
V1 — 2?’ 
The Chebyshev polynomial T,, of degree n is defined by 
Tn(x) := cos(narccosz), —-l<a2< 1. 


Obviously To(z) = 1 and T;(x) = x. From the addition theorem for the 
cosine function, cos(n + 1)t + cos(n — 1)t = 2 cost cos nt, we can deduce the 
recursion formula 


Tr4i(Z) + Tp-1(%) = 227, (zr), n=1,2,.... 
Hence we have that T,, € P, with leading term 
Tn(x) = 2" ta" 4+---, n=1,2,.... 


Substituting z = cost we find that 


Tr, n=m=Q), 
Tn(t)Tm(x) | c= 7 
= cos nt cos mt dt = ~, n=m>Q0 
—] V1 — <2? 0 2 
0, n#~m. 


Hence, the orthogonal polynomials q, of Lemma 9.15 are given by 
Qn = 2'~"T,,. The zeros of T, and hence the quadrature points are given 


by 
2k+1 
11 = 008 ( + r). k=0,...,n—1. 
2n 


The weights can be most easily derived from the exactness conditions 


yu (t,) = Im(2) m=0,...,n-—1, 
/Vina* 


for the interpolation quadrature, i.e., from 


n-1 (2k + 1)m T, m= 0, 
Yaz cos EFM 
k=0 n 0, m=l1,...,n—1. 


From our analysis of trigonometric interpolation, i.e., from (8.19), we see 
that the unique solution of this linear system is given by 


a,.=—, k=0,...,n-—1. 


mr 
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Hence, for n = 1,2,... the Gauss—Chebyshev quadrature of order n — 1 is 
Fon (cos 
dx = COS 
ees —S >» f 


given by 
From Theorem 9.20 we have the error representation 


' f ~ 2k +1 fOr) (€) 
eee AG on r) = 7 AS 


_1V1-<2? 227-1(2n)! 
for some € € [—1, 1]. O 
Example 9.22 We now consider the weight function 
w(x)=1, 2«€[-1,1). 


The Legendre polynomial L,, of wee n is defined by 


Ly(x) = — (2? — 1)", 


Obviously, L, € P,. If m <n, by repeated partial integration we see that 


ai — 


1 d” ; 
[« an | —1)"dzx = 0 
since (x? — 1)” has zeros of order n at the endpoints —1 and 1. Therefore, 


[ En(z)Lm(xz) dx =0, n#m. 
~1 


The zeros of the Legendre polynomials, and therefore the quadrature 
points and weights of the corresponding Gauss—Legendre quadratures, can- 
not be given explicitly by a simple expression. We consider only the cases 
n = 1 and n = 2 and note that 
» 1 
go(t)=1, g(z)=2z, g(r) =2° - 3° 
where the coefficient of gz can be determined from i qo(z) dx = 0. 

The quadrature point for the first Gauss—Legendre formula is z; = 0, 
and the weight a; can be obtained from the exactness condition 


1 
a, = | dx = 2. 
—1 


Hence the first Gauss—Legendre formula is given by 


/ " f(a) dx = 2f(0) (9.20) 
~1 
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with the error representation 


| faae- 270) = 5 1" 


for some € € [—1, 1]. The coefficient of the derivative on the right-hand 
side follows most easily by inserting f(x) = x”. For obvious reasons, this 
Gauss—Legendre formula is also known as the midpoint rule. 

The quadrature points for the second Gauss—Legendre formula are 
ry = —1/ /3 and rz. = 1 /V3. The weights can be obtained from the exact- 
ness conditions 


1 
a +a, = [ dx = 2, 


—1 


1 
Q,X1 + 4222 -| xdzx = 0, 


~1 
and they have the values a; = ag = 1. Hence the second Gauss—Legendre 
formula is given by 


| a) +1 (3) 
x) dz & —]|+f|— 
[ teaens(S)+t(Z 
with the error representation 


/ fia)ae—¢(F)-s (3) =3 ae 


for some € € [—1,1]. The coefficient on the right-hand side follows by 
inserting f(x) = z+. O 


From the Gaussian quadrature formula 


/ g(2) dz = S> agg(ce) 
—1 k=0 


of order n for the interval [—1, 1], by substituting 


a+b =a, 
2 2 


and f(x) = g(z) we obtain the Gaussian quadrature formula 
b n 
b-—a a+b b-a 
/ f(a) dx = —— Sous ( a tS m1) 


for an arbitrary interval [a,b]. The error representation 


1 n (2n) 
/ g(z) dz — S¢ axg(tk) = - £9 [ [an+1(x)|*dx 
—1 k=0 


4 


(2n +2)! 
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with ¢ € [—1,1] can be transformed accordingly. Subdividing the interval 
[a,b] into m equidistant subintervals with step width h = (b — a)/m and 
then applying to each subinterval the Gaussian quadrature formula of order 
n, we obtain the composite Gaussian quadrature 


b i he 1 h h 
[ 1 tS 5 Y Las (ars +5 +51) 


j=0 k=0 


with an error of order O(h?"). These composite Gaussian rules are used 
quite frequently in practice. We illustrate their convergence behavior by 
Table 9.3, which gives the error between the exact value of the integral 
from Example 9.6 and its numerical approximation by composite Gaussian 
quadrature of orders one and two. As predicted by our error analysis, if the 
number n of quadrature points is doubled, i.e., if the step size h is halved, 
then the error for the Gaussian quadrature of orders one and two is reduced 
roughly by the factor 1/4 and 1/16, respectively. 


TABLE 9.3. Gaussian quadrature for Example 9.6 


0.02648051 
0.00743289 


0.00192729 
0.00048663 
0.00012197 
0.00003051 


0.00083949 
0.00007054 
0.00000489 
0.00000031 
0.00000002 
0.00000000 
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We proceed by deriving the Euler-Maclaurin expansion. 


Definition 9.23 The Bernoulli polynomials B, of degree n are defined 
recursively by Bo(x) := 1 and 


Bi := Bn-1, neN, (9.21) 
with the normalization condition 
1 
| Bn(z)dx =0, neN. (9.22) 
0 


The rational numbers 
bn := n!B,(0), n=0,1,..., 


are called Bernoulli numbers. 
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The first Bernoulli polynomials are given by 


1 1 1 1 
B =1, B =gz--—, B ~ 7? — — = . 
o(z) ; (x) = 2 5? (x) = 5-5 x + 9 
We note that the normalization (9.22) is equivalent to 
B,(0) = Br), n=2,3,.... (9.23) 


Lemma 9.24 The Bernoulli polynomials have the symmetry property 
B,(2) = (-1)"B,(01-2), zrER, n=0,1,.... (9.24) 


Proof. Obviously (9.24) holds for n = 0. Assume that (9.24) has been 
proven for some n > 0. Then, integrating (9.24), we obtain 


Bn4yi(z) = (—1)"** Bayi (1 — 2) + Bat 


for some constant 6,41. The condition (9.22) implies that 8,41 = 0, and 
therefore (9.24) is also valid for n + 1. Oo 


Lemma 9.25 The Bernoulli polynomials Bomii, m = 1,2,..., of odd 
degree have exactly three zeros in [0,1], and these zeros are at the points 
0, 1/2, and1. The Bernoulli polynomials Bom, m = 0,1,..., of even degree 


satisfy Bom(0) 4 0. 


Proof. From (9.23) and (9.24) we conclude that Bom+1 vanishes at the 
points 0, 1/2, and 1. We prove by induction that these are the only zeros 
of Bom+1 in [0,1]. This is true for m = 1, since B3 is a polynomial of degree 
three. Assume that we have proven that Bj; has only the three zeros 
0, 1/2, and 1 in [0,1], and assume that Bo,+3 has an additional zero a in 
[0,1]. Because of the symmetry (9.24) we may assume that a € (0,1/2). 
Then, by Rolle’s theorem, we conclude that Bom4+42 has at least one zero in 
(0, a) and also at least one zero in (a, 1/2). Again by Rolle’s theorem this 
implies that Bo ,4, has a zero in (0,1/2), which contradicts the induction 
assumption. 

From the zeros of Bam4+1, by Rolle’s theorem it follows that Bo, has a 
zero in (0, 1/2). Assume that B2,,(0) = 0. Then, by Rolle’s theorem, Bom —1 
has a zero in (0,1/2), which contradicts the first part of the lemma. 0 


By B, : RO IR_we denote the periodic extension of the Bernoulli 
polynomial B,; i.e., By, has period 1 and B,(z) = B,(x) forO <2 <1. 
The Fourier series of the periodic functions B, are given by 


Bom (2) = 2(-1)"7} 3 one (9.25) 


and 


~ “~. sin2rkz 
m—1 (x) = 2(-1)™ 2 
Bom—1(Z) = 1)” Dank) ankyemi (9.26) 
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for m = 1,2,.... This follows from (9.21) and (9.22) and the elementary 
Fourier expansion for the piecewise linear function B, (see Problem 9.13). 

Let cz, = a+kh, k = O,...,n, be an equidistant subdivision of the 
interval [a,b] with step size h = (b — a)/n and recall the definition of the 
trapezoidal sum 


5 F(t0) + f(as) ++ + fltn-1) + 5 (tn) 


Th(f) = h 
for f € C[a, bj. 


Theorem 9.26 Let f : [a,b] > IR be m times continuously differentiable 
form > 2. Then we have the Euler—Maclaurin expansion 


* [fi (b) — f%-) (a)] 
j (9.27) 


pe Bin ( * *) f°) (x) da 


where [=| denotes the largest integer smaller than or equal to 7}. 


Proof. Let g € C™[0,1]. Then, by m — 1 partial integrations and using 
(9.23) we find that 


/ By (z)9'(z) dz = D(=1)Bj(0)[9— (D) — 99 (0)] 


~(-1)™ / Bm(z)g'™ (z) dz. 


Combining this with the partial integration 


[ Bieia'@) az = 5 lot) + 9(0))- [oleae 


and observing that the odd Bernoulli numbers vanish leads to 
a b2; [g°23-V) (27-1) 
' -9(2) dz = [9(0) + g(1)] - 2 Bi) (1) — g®?-(0)| 


+1)" | Bm(z)g'™ (z) dz. 
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Now we substitute x = 2, + hz and g(z) = f(r, + hz) to obtain 


[7 sed ae = F len) + Flees) 


[F] 
bo, h?4 
—S0 E [f PFD (esr) — FO? (ae) 
=i (25)! 
j=1 
T+ _ 
+(-mam | Bm € *) f° (a) de. 
Th h 
Finally, we sum the last equation for k = 0,...,n — 1 to arrive at the 
Euler—Maclaurin expansion (9.27). Oo 


For 27-periodic continuous functions f : IR — R the trapezoidal rule 
coincides with the rectangular rule 


20 n 
21 27k 
k=1 
For its error 


Eat) = [ f(e)de~ 7 = ot (™) 


we have the following corollary of the Euler-Maclaurin expansion. 


Corollary 9.27 Let f : IR > R be (2m + 1)-times continuously differen- 
tiable and 27-periodic for m € IN and let n € IN. Then for the error of the 
rectangular rule we have 


C 20 5 ' 
IEn(f)| S “aaa / [FO (x)| de 
where 
— 
C= 2) mit: 
k=1 
Proof. From Theorem 9.26 we have that 


2m+1 2a 
En(f) an (=) / Bom+i (==) femt) (cr) dx 


and the estimate follows from the inequality 


~ < 1 
|Bom+1(x)| < 25° (Qnkpamti ° re, 
k=1 


which is a consequence of (9.26). Oo 
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Corollary 9.27 illustrates why for periodic functions the simple rectangu- 
lar rule is superior to any other quadrature rule (see Problem 9.12). Note 
that the rectangular rule can also be obtained by integrating the trigono- 
metric interpolation polynomials of Theorems 8.24 and 8.25. 

In the following theorem we give an example of derivative-free error esti- 
mates for numerical quadrature rules in the spirit of Davis [15]. They have 
the advantage that they do not need the computation of higher derivatives 
for the evaluation of the estimates. However, they require the integrand to 
be analytic, and their proofs need complex analysis. 


Theorem 9.28 Let f : IR —- R be analytic and 27-periodic. Then there 
exists a strip D = IR x (—a,a) C C witha > 0 such that f can be extended 
to a holomorphic and 2x-periodic bounded function f : D — U. The error 
for the rectangular rule can be estimated by 


4 
En(f)| < 
ena — 


where M denotes a bound for the holomorphic function f on D. 


Proof. Since f : IR — R is analytic, at each point c € R the Taylor 
expansion provides a holomorphic extension of f into some open disk in the 
complex plane with radius r(x) > 0 and center z. The extended function 
again has period 27, since the coefficients of the Taylor series at x and 
at x + 2m coincide for the 27—periodic function f : IR — IR. The disks 
corresponding to all points of the interval [0,27] provide an open covering 
of [0,27]. Since [0, 27] is compact, a finite number of these disks suffices to 
cover [0,27]. Then we have an extension into a strip D with finite width 
2a contained in the union of the finite number of disks. Without loss of 
generality we may assume that f is bounded on D. 
From the residue theorem we have that 


1a+27 —ia+t+2n - mn 
4 
/ cot — (edz | cot — f(z) dz = ——— 1 (=) 
for each 0 < a < a. This implies that 
t1a+27 n 
NZ 20 21k 
Re [ teot > f(z) dz = —— (5). 


since by the Schwarz reflection principle, f enjoys the symmetry property 
f(Z) = f(z). By Cauchy’s integral theorem we have 


ia+2n Qn 
Re [ f(z)dz= f(x) dz, 
i 0 


a 
and combining the last two equations yields 
1a+20 


En(f) = Re [ 


1a 


(1 — icot =} f(z) dz 
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for all 0 < a < a. Now the estimate follows from 
2 


era — ] 


NZ 
1 — 7cot | < 
: 217— 
for Im z = a and then passing to the limit a — a. oO 


The estimate shows that for periodic analytic functions the rectangu- 
lar rule is of exponential order; i.e., doubling the number of quadrature 
points doubles the number of correct digits in the approximate value for 
the integral. 


9.5 Romberg Integration 


We now proceed with describing the extrapolation method due to Richard- 
son (1927). Its basic idea is to derive high-order approximation methods 
from simple low-order methods. It can be applied to a variety of formulae in 
numerical analysis, and its application to the Euler-Maclaurin expansion 
was suggested by Romberg in 1955. 

Recall the composite trapezoidal rule 


Th (f) = hh E f(a) + 3 flat kh) + - f(b) 
k=1 


with step size h = (b—a)/n. If f is four-times continuously differentiable, 
by the Euler~Maclaurin expansion from Theorem 9.26 we have an error 
representation of the form 


[ f(x) dx =T,(f) + wh? + O(h*) 


for some constant -y; depending on f but not on h. Hence, for half the step 
size, we have that 


b h2 
[ faz = TEA) +n F + OU). 


From these two equations we can eliminate the terms containing h?; i.e., 
we multiply the first equation by —1/3 and the second equation by 4/3 and 
add both equations to obtain 


1 


b 
[ serae = 5 [ATi f) - THN] + O00). 


Hence, the linear combination 


[47% (Ff) - Ta(P)] 


Wile 


TRS) = 
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of the composite trapezoidal rule with step sizes h and h/2 leads to a 
quadrature formula with the improved error order O(h*). The quadrature 
T?(f) coincides with the composite Simpson’s rule for the step size h/2. 

If f is six-times continuously differentiable, by linearly combining the 
Euler—Maclaurin formulae for the step sizes h and h/2 we obtain an error 
representation of the form 


b 
/ f(a) dx = T2(f) + yh! + O(hS) 


for some constant 72 depending only on f. From this and the corresponding 
formula 


[ f( (v) dx = Tx (f) +72 P+ O(n!) 


for step size h/2, by eliminating the terms containing h* we obtain the 
quadrature formula 


TEA) = =z [1OTF(A) - TRIN] 


with an error of order O(h®). Note that the actual numerical evaluation of 
T?(f) requires the values for the composite trapezoidal rule for the step 
sizes h, h/2, and h/4. 

Obviously, this procedure can be repeated, and this leads to the sequence 
of Romberg quadrature formulae. Let 


Th(f) = Th, (f), &=0,1,2,..., 
be the trapezoidal sums for the step sizes hy := h/2*. Then for m = 1,2,... 


the Romberg quadratures are recursively defined by 


a TEA) TEA], K=O (9.28) 


For the error we have the following theorem. 


Tt \(f) = 


Theorem 9.29 Let f : [a,b] + IR be 2m-times continuously differentiable. 
Then for the Romberg quadratures we have the error estimate 


b h 2m 
[ sera TP (f)| < CmIlFOMls (Se) B= OL 


for some constant C, depending on m. 


Proof. By induction, we show that there exist constants y;,, such that 


b h 2) 
[ se@ac- TH Lou 6-0) = f(a] (J) 


h 2m 
< Ym, all foe |Loo @ 


(9.29) 
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fori =1,...,m and k =0,1,.... Here the sum on the left-hand side is set 
equal to zero for i = m. By the Euler—Maclaurin expansion this is true for 
t= 1 with 73,1 = bo;/(2j)! for j = 1,...,m—1 and 


Ym = (b-—a) sup |Bom(x)I. 
z€[0,1] 


As an abbreviation we set 
F; := f?-Y(0) — fP8F-YV(a), Gg =1,...,.m—-1. 


Assume that (9.29) has been shown for some 1 < i < m. Then, using (9.28), 
we obtain 


4t b m—1 h 2] 
4i_] / f(a) dx — Thy (f) - > (sex) Vii45 


j= 


1 b . m—1 2) 
lf fea -Y (ge) WF 


jai 


b . m—1 h 2) 
= [ Heyae- T3100) - YO (se) the 
a j=itl 
where i, 
4t~J —] 


Vit = Bq Vat g=itl,...,.m—-1. 


Now with the aid of the induction assumption we can estimate 


b m—1 h 23 
/ f(a) de —Ti(f)- So yas (=| 15,0415 


j=itl 


h 2m 
< Ym l| fe [loo €3 


where 
_ 4¢~™ 41 
Ym,i+1 = “Ai_l ) 


and the proof is complete. O 


From Theorem 9.29 we conclude that the Romberg quadrature 77” in- 
tegrates polynomials of degree less than or equal to 2m — 1 exactly. For 
h = b—a the Romberg quadrature 73" uses 2™~! +1 equidistant integration 
points. Therefore, Tj coincides with the trapezoidal rule, T? with Simp- 
son’s rule, and T? with Milne’s rule. Similarly, T}, T7, and T? correspond 
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to the composite trapezoidal rule, the composite Simpson’s rule, and the 
composite Milne rule, respectively. For m > 4 the number of the quadrature 
points in Tj” is greater than the degree of exactness. The Romberg formula 
Tj uses nine quadrature points, and this is the number of quadrature points 
where the Newton—Cotes formulae start having negative weights. 


Theorem 9.30 The quadrature weights of the Romberg formulae are pos- 
ative. 


Proof. We define recursively Qj := 4T;,, — 2T; and 


mm 1 m 
ke +l nn [aor + 27h + am ttor | (9.30) 
fork = 1,2,... and m=1,2... and show by induction that 
1 
Te = a i + OF). (9.31) 


By the definition of Q; this is true for m = 1. We assume that (9.31) has 
been proven for some m > 1. Then, using the recursive definitions of T;” 
and @;” and the induction assumption, we derive 


soy Tiha + Gea] — qe OT — Tr 
4 1 4 1 


Tet +QRTt = 
= qmtiqimtl _qimtl — qmt+l _ yypit?. 


i.e., (9.31) also holds for m+ 1. Now, from (9.30) and (9.31), by induction 
with respect to m, it can be deduced that the weights of Tj” are positive 
and that the weights of Q7 are nonnegative. 0 


Corollary 9.31 For the Romberg quadratures we have convergence: 


b b 
lim T™(f) = / f(c)dx and lim T™(f) = / fla) de 
m—0o a k-0o a 
for all continuous functions f. 


Proof. This follows from Theorems 9.29 and 9.30 and Corollary 9.11.  O 


For continuous functions, the trapezoidal sums converge as the step size 
tends to zero. This motivates us to consider a polynomial in h? interpolating 
the values T; (f),..-,Té4m(f) at the interpolation points h?,...,h2,,, and 
evaluate it at h = 0. 


Theorem 9.32 Denote by LY the uniquely determined polynomial in h? 
of degree less than or equal to m with the interpolation property 


LR(AS) =T3(f), gj =k,...,k+m. 
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Then the Romberg quadratures satisfy 
Tr! (f) = Le (0). (9.32) 


Proof. Obviously, (9.32) is true for m = 0. Assume that it has been proven 
for m — 1. Then, using the Neville scheme from Theorem 8.9, we obtain 


1 


L'™(0) = oR [—hg Le (0) + hey m LR * (0) 
+m 
1 2mm 2 m 
= he, — he [—Ag Te + hee m Te 
~ 4m q [ae Tea — Te] = Tt" 
establishing (9.32) for m. O 


This interpretation of the Romberg quadrature as an extrapolation method 
in the sense of Richardson opens up the possibility of modifications using 
other than equidistant step sizes. 

Table 9.4 gives the error between the exact value of the integral from 
Example 9.6 and its numerical approximation by the Romberg quadrature, 
exhibiting its fast convergence according to the error estimates of Theorem 
9.29. Clearly, the first two columns of Table 9.4 have to coincide with Table 
9.2. 


TABLE 9.4. Romberg quadratures for Example 9.6 


—0.05685282 
—0.01518615 | —0.00129726 


—0.00387663 | —0.00010679 | —0.00002742 

—0.00097467 | —0.00000735 | —0.00000072 | —0.00000030 
—0.00024402 | —0.00000047 | —0.00000001 | —0.00000000 
—0.00006103 | —0.00000003 | —0.00000000 | —0.00000000 


We finish this section with the corresponding Table 9.5 for the integral 


2 


[ Vz dx = 3 (9.33) 


of a function that is not differentiable in all of the integration interval. Not 
surprisingly, the convergence is notably slower. 
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TABLE 9.5. Romberg quadratures for the integral (9.33) 


0.166667 
0.063113 | 0.028595 


0.023384 | 0.010140 | 0.008910 

0.008536 | 0.003587 | 0.003151 | 0.003059 

0.003085 | 0.001268 | 0.001114 | 0.001082 | 0.001074 
0.001108 | 0.000448 | 0.000394 | 0.000382 | 0.000380 


9.6 Improper Integrals 


We conclude this chapter with an example for the numerical integration of 
improper integrals and describe a class of quadrature rules for the integral 


[ f(x) dz 


where the integrand f is sufficiently smooth in (0,1) but is allowed to have 
singularities at the endpoints x = 0 and x = 1 such that f is nonetheless 
integrable. 

Let the function w : [0,27] — [0,1] be bijective, strictly monotonically 
increasing, and infinitely differentiable. Then we can substitute x = w(t) 
and consequently obtain 


I “f(a) de = | g(t) dt, 


g(t) := w'(t) f(w(t)), O<t < 2r. 
Now assume that the function w has derivatives 


w'F(0) = wI(2r)=0, j=l,...,p—l, (9.34) 


where 


and 
w'?)(0) £0, w)(2r) £0 (9.35) 


for some p € IN. Then we may expect that the function g and some of 
its derivatives up to a certain order vanish at t = 0 and t = 27; ie., g 
can be considered as a sufficiently smooth 2z-periodic function, and the 
rectangular rule may be applied to the transformed integral. This yields 
the quadrature formula 


| f(z) dz x > arf (xz) (9.36) 
k=1 
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with the quadrature points and weights given by 


2nk 2 2 
nm =v (=), a= wu (), k=1,...,n—1. 


n n 
In addition, it is natural to require the symmetry property 
w(t) =1—w(2nr -t), te [0,27]. (9.37) 
Then the quadrature points and weights have the symmetry 
In-k =—1—TR, Gn-p~p=ay, kK=1,...,n—-1, 


and from the assumptions (9.34) and (9.35), by Taylor’s formula, it follows 
that they satisfy the inequalities 


k\? k\? n 
~j)e< —T_ — =1,...,/- ; 
Co @ S &k, 1 ILn—k < Ci @ ’ k 1, ’ | ) (9 38) 


and 


k\?-' k\?-? n 
_ < — =1,...,]= . 
Co @ < Gk, Gn-k $1 @ , kK=1,..., S| (9.39) 


for some constants 0 < cg < c; depending on the function w. From (9.38) it 
is obvious that the quadrature points are graded towards the two endpoints 
x = 0 and x = 1 of the integration interval. 

For substitutions with the properties (9.34), (9.35), and (9.37), from the 
Euler—Maclaurin expansion applied to the integral over g we now will derive 
an estimate for the remainder term 


Enl(f) = [ f(a) de ~ Y anf(en) 
k=1 


For q € IN and 0 <a < 1 by S%® we denote the linear space of q-times 
continuously differentiable functions f : (0,1) — IR for which 


sup [x(1— 2)}?**~* |f (2)| < 00 
0<2<1 


for 7 = 0,...,q. On S?% we define the norm 


fll i= max sup [x(1 — x)t*~? | fF) (2)}. 
I=9,.--.9 O<a<l 


Then, clearly 
If (2) < Wfllaale(d—2)2-71, O< 2 <1, (9.40) 


for 7 = 0,...,q. 
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Theorem 9.33 Let p € IN and assume that w satisfies (9.34), (9.35), and 
(9.37). Further, let q € IN and f € S?79+!}% with 0 <a<1 such that 


2gq+1l<ap and 2q¢+2<p. 


Then the error in the quadrature formula (9.36) can be estimated by 


C 
IEn(f)| < aati Il fllog+t,0 
with some constant C' depending on w, a, and q. 


Proof. For the derivatives of g we can write 


g(t) = 57 u(t) f (w(t), r=0,...,2q+ 1. 


j=0 
Then from 
(r+1) - r ! (j+1) ue ) (3) 
gt Y(t) = So fuk (t) w'(t) £94) (w(t)) + —S— fF (w(t) 
j=0 
we derive the recursion formulae 
dug (t) _ 
a j = 0, 
r+l 4) — ul (t 
U; (t) ur i(t) w’ (t) , eu) a ) = 1, 1, (9.41) 
u;(t) w'(t), j=rtl, 
for the coefficients u. In particular, we have 
ub(t) = wt) (t) and ul (t) = [w'(t)]"*?. (9.42) 
The functions u/ satisfy 
ut(t) = O([t(24 —t)])*7, t(2n—t) +0, (9.43) 


forr =0,...,2¢g+1 andj =0,...,r, where 
z3=p-—1+ jp-r. 


For j = 0 and j = r this is obvious from the assumption on w and (9.42), 
and for 7 = 1,...,r —1 it follows by induction from the recursion formulae 
(9.41). Note that 2? > 0 because of the assumption p > 2q + 2. 

Using (9.40) and the assumptions on w, we can estimate 


If (w(t))| < Cillfllag+1,alé(2m — e)]O-I-Y? 
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for some constant C', and with the aid of (9.43), we further obtain that 
lu5(t) FY (w(t))| < Callfllegtialt(2m — t)°P-""', O<t< 2m, (9.44) 


for some constant Cy and r = 0,...,2q¢+1 and j = 0,...,r. From this, 
since ap > 2q¢+1, we observe that for r = 0,...,2q the derivatives g‘”) can 
be continuously extended from (0,27) onto [0, 27] with values 


g'") (0) = g' (27) =0, r=0,...,2¢. 


Furthermore, from (9.44) and the assumption ap > 2q + 1 we see that the 
integral of g(29+1) over (0, 27] exists as an improper integral and 


2a 
/ Ig?2t) (t)| dt < Csl|fllogti,a 
0 


with some constant C3 depending on w, a, and g. Now the statement fol- 
lows from Corollary 9.27 of the Euler-Maclaurin expansion. Note that for 
the Euler—Maclaurin expansion (9.27) to be valid it obviously suffices that 
the integral of the error term exists as an improper integral. O 


We proceed by describing a few examples for substitutions w (see Prob- 
lem 9.19). In 1963 Korobov suggested the polynomial transformation 


W(t) := if [s(2a — SPP tds ~ [see —s)|P-'ds. (9.45) 


The trigonometric transformation 


—1 


wWp(t) := if sin?-! = is [ sin?-! = ds (9.46) 
pT fg 2 0 2 


with the special cases 


t 1 t 1 
wilt) = 5, un(t) = 5 (1-008 5), w3(t) = 5— (¢ — sin?) 


was proposed by Sidi [54]. Substitutions of the form 


4P 
t) = ——_-——_—— 9.47 
welt) tP + (Qn — t)P (9.47) 
were considered in [40]. As a rule of thumb, these substitutions should not 
be used for p too large, say p > 10, because this may lead to overgrading 
and numerical difficulties with underflow. The substitutions 


an 7 7 a 7 1 
w(t) = / exp (-=- sn) as i exp (-= - A ds 
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and 


vo eke) 


with zeros of infinite order at the endpoints, which were suggested by Iri, 
Moriguti, and Takesawa [58] and by Sag and Szekeres [52], respectively, 
also suffer from this drawback. 

As a numerical example we consider the improper integral 


1 


Table 9.6 gives the error between the exact value and the numerical ap- 
proximation obtained by using the substitution (9.47). 


TABLE 9.6. Numerical quadrature for the integral (9.48) 


0.07012542 | —0.06064201 | —0.22007377 | —0.42795942 
0.02849925 0.00455233 | —0.00438402 | —0.01896018 


0.00992273 0.00129852 0.00011279 | —0.00003394 
0.00347755 0.00032530 0.00002117 | —0.00000019 
0.00122386 0.00008137 0.00000382 | —0.00000001 


Problems 


9.1 Show that the error for the composite trapezoidal rule can be expressed in 
the form 


b b 
/ f(x) dx — Tr(f) = - / Kr(2) f(x) dr, 


where the so-called Peano kernel Kr is given by 
1 
Kr(x) = 5 (a —ap-1)(@_p —2), Le-1 < L< Lk, 


for k = 1,...,n. Use this error representation for an alternative proof of Theorem 
9.7. 


9.2 Show that the error for the composite Simpson’s rule can be expressed in 
the form 


b b 
/ f(a) de — Si(f) = - / Ks(x) f(a) de, 
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where the Peano kernel Ks is given by 


1 
is (x — t%-2)° — 54 (x — tp~2)'*, Lk-2 SL SL Le-1, 
Ks(a) := 
fi, —2) — 2 (2 ~—x)* Te-1S2< 2 
18 24 \"* Fh ask 
for k = 2,4,...,n. Use this error representation for an alternative proof of The- 
orem 9.8. 


9.3 For Newton’s three-eights rule prove the error representation 


b 5 
[sepa - 4 80a) + 39(a +m) + 340-0) + 10) = -% fE 


with some € in [a,b] and h = (b — a)/3. 


9.4 Show that the weight a, for the Newton—Cotes formula of order eight is 
negative. 


9.5 For the remainder EF, of the Newton—Cotes formula of order n on the in- 
terval [—1, 1], applied to the Chebyshev polynomial T,,41, show that 


(n+1)!4"t! /” z 
En(Tn41) = ——— ; IN. 


From this conclude that if n odd, then 
Enlloo 2 |Hn(Tn+1)| 2 Yn, 
where 


(n —1)!4"+? 
Yn = a 
3nnrt2 


Hint: Use Theorem 8.10 and show that 


E(is)eef (nia) 


9.6 Compute the weights for the polynomial interpolatory quadratures with 
equidistant quadrature points 


+ oo, n->0o. 


for n odd. 


b—a 


rR =at+(k+1) nad’ 


k=0,1,...,n, 

for n = 0,1, 2 and obtain representations of the quadrature errors. These formulae 
are called open Newton-Cotes quadratures, since the two endpoints a and 6 are 
omitted. 
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9.7 For n € IN, a quadrature formula of the form 


b n 
[ sears 2 fen 


with distinct quadrature points 11,...,2n € [a,b] and equal weights is called a 
Chebyshev quadrature if it integrates polynomials in P, exactly. Find the Cheby- 
shev quadratures for n = 1,2,3,4. (Chebyshev quadratures exist only for n < 8.) 


9.8 Show that there exists no polynomial interpolatory quadrature of order n 
that integrates polynomials of degree 2n + 2 exactly. 


9.9 The Chebyshev polynomial of the second kind U, of degree n is defined by 


sin((n + 1) arccos z) 
sin(arccos x) 


Show that Uo(x) = 1, Ui(x) = 22, and 
Un+1(£) + Un-1(a2) = 22Un(z), n=1,2,.... 


Un(x) := -l<r<l. 


3 


Prove the orthogonality relation 


1 
/ V1 — 2? Un(2)Um(x) dx = 5 Onm: 
1 


9.10 Show that the quadrature points and quadrature weights for the Gauss— 
Chebyshev quadrature of order n — 1 for the integral 


/ Vi — 2? f(x) de 


are given by 


rt, = cos © +1 T 
n+1 
and 
ak Tv o9k+ 
n+1 n+1 


fork =0,...,n—1. 


9.11 Find the quadrature weights ao, a1, @2,a3, and the (remaining) quadrature 
points 21,22 of a quadrature formula of the form 


/ f() de = aof(—1) +.a1f(21) + a2f(a2) + a9f(1) 


that is exact for all polynomials in P;. (This is an example of a Gauss—Lobatto 
quadrature, i.e., a Gauss quadrature with two preassigned quadrature points.) 

Find the quadrature weights ao, a1, a2, and the (remaining) quadrature points 
X1,22 of a quadrature formula of the form 


/ f(x) de = aof(—1) + aif(ar) + anf(22) 


that is exact for all polynomials in P4. (This is an example of a Gauss—Radau 
quadrature, i.e., a Gauss quadrature with one preassigned quadrature point.) 
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9.12 By approximating the integral 


2n 
1 27 
gy = A 
| 5—4ceost 3 


by the rectangular rule and Simpson’s rule convince yourself of the superiority of 
the rectangular rule for periodic functions. 


9.13 Verify the Fourier series (9.25) and (9.26) for the periodic Bernoulli poly- 
nomials. 


9.14 For the Bernoulli polynomials show that the series 


0° n tet® 
S| Bn(x)t” = Tq 
n=0 


is absolutely and locally uniformly convergent for all x € [0, 1] and all ¢ € (—1,1). 


9.15 Derive a quadrature formula by integrating the interpolating cubic spline 
from Theorem 8.30 and discuss its relation to the Euler-Maclaurin expansion. 


9.16 Write a computer program for Romberg integration and test it for various 
examples. 


9.17 Calculate the weights of the Romberg quadratures T? and T;. 


9.18 Show that the Richardson extrapolation for the midpoint rule (9.20) leads 
to nonnegative quadrature weights. 


9.19 Show that the functions (9.45), (9.46), and (9.47) are strictly monoton- 
ically increasing, infinitely differentiable, and map [0,27] onto [0,1] such that 
(9.34), (9.35), and (9.37) are satisfied. 


9.20 Write a computer program for the numerical quadrature (9.36) using the 
substitution (9.47) and test it for various examples. 


10 


Initial Value Problems 


Historically, the study of differential equations originated in the beginnings 
of calculus with Newton and Leibniz in the seventeenth century and is 
closely interwoven with the general development of mathematics. To a sub- 
stantial degree, the central role of differential equations within mathematics 
is due to the fact that many important problems in science and engineering 
are modeled by differential equations. 

This chapter will be devoted to an introduction to the basic numerical 
approximation methods for initial value problems for ordinary differential 
equations. For a more comprehensive study we refer to [13, 33, 42, 46, 55]. 
Analogous to the need for numerical quadrature formulas, numerical meth- 
ods for the approximate solution of ordinary differential equations are nec- 
essary, Since in general, no explicit solutions of the differential equation 
will be known, despite the fact that there exists a broad range of analyt- 
ical solution methods for special classes of ordinary differential equations. 
In addition, the functions and data involved in the differential equation 
problem quite often will be available only at discrete points. However, we 
would like to emphasize that despite the availability of numerical methods 
the study of elementary analytical methods for the solution of ordinary 
differential equations remains worthwhile, since it provides a first step into 
gaining insight into the general structure of differential equations. 

A solid foundation for numerical approximation methods for differen- 
tial equations, including their convergence and error analysis, requires as 
a prerequisite results on the existence and uniqueness of the solution to 
the problem to be approximately solved. Therefore, in Section 10.1 we will 
begin with proving the fundamental Picard—Lindel6f existence and unique- 
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ness theorem for initial value problems. In Section 10.2 we will describe 
some variants of the simplest method for the numerical solution of initial 
value problems, which was first used by Euler. These methods are special 
cases of so-called single-step methods, for which we will give a convergence 
and error analysis in Section 10.3. This section also includes a short dis- 
cussion of the Runge-Kutta method as the most widely used single-step 
method. The final section, Section 10.4, is concerned with the description 
and analysis of multistep methods. 

We wish to note explicitly that this chapter is also meant to serve as 
an application of some of the material provided in Chapters 8 and 9 on 
interpolation and numerical integration. 


10.1 The Picard—Lindelof Theorem 


Definition 10.1 Let G C IR’ be a domain and f :G > R. A continuously 
differentiable function u : [a,b] — R is called a solution of the ordinary 
differential equation of the first order 


u = f(a,u) (10.1) 
if (x, u(z)) € G and u'(x) = f(a, u(x)) for all x € {a, b]. 


Geometrically speaking, the differential equation (10.1) defines a field of 
directions on G. Solving the differential equation means looking for func- 
tions whose graphs match this field of directions. 

Systems of ordinary differential equations can be included in the dis- 
cussion as follows. If G C IR"*’ is a domain and f : G > IR”, then a 
continuously differentiable function u : [a,b] - IR” is called a solution of 
the system of ordinary differential equations of the first order 


wu = f(z,u) 


if (v,u(z)) € G and u'(z) = f(z, u(z)) for all x € [a,b]. More explicitly, 
this system reads 


u, = fil(z,ui,...,Un) 
Us = fo(z,u1, ,Un) 
us, = fn(@,U1,-.--,Un). 


for u = (uw,...,Un)? and f = (fi,...,fn)’. Each ordinary differential 
equation 
u™ = f(a,u,u',...,u'?-)) 
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of order n is equivalent to the system 
! / ! _ A 
Ul = U2, Up =U3, .--, Up_y=Un, U, = fn(Z,U1,-..-,Un) 


via uy = U, Ug = U',...,Un = ul"). Therefore, in principle, considering 
only differential equations of the first order is no loss of generality. 

From the wide field of applications we sketch only the following two 
simple examples. 


Example 10.2 By Newton’s law, the differential equation of the second 
order 


mu" = f(t,u) 


describes the motion of an object of mass m subject to the external force 
f(t,u) depending on the location u of the object and the time ¢t. Given an 
initial location uo and an initial velocity ug at the initial time ¢ = 0, one 
wants to find the position u(t) of the object for all times t > 0. Oo 


Example 10.3 Let p = p(t) describe the population of a species of animals 
or plants at time t. If r(t, p) denotes the growth rate given by the difference 
between the birth and death rate depending on the time ¢ and the size 
p of the population, then an isolated population satisfies the differential 
equation 


dp 
The simplest model r(t, p) = ap, where a is a positive constant, leads to 


dp 
ae 7? 

with the explicit solution p(t) = poe®¢—*>). Such an exponential growth is 
realistic only if the population is not too large. The modified model 


oe = ap — bp” 

with positive constants a and b contains a correction term that slows down 
the growth rate for large populations and is known as the Verhulst equa- 
tion. It was introduced by Verhulst in 1938 as a model for the growth of 
the human population. In general, for a given growth rate r one wants to 
determine the development of the population p(t) in time for a given initial 
population po at time t = to. 0 


Both examples are typical initial value problems: Find a solution of a 
differential equation that attains a given initial value at a given initial 
time. This notion is made more precise by the following definition. 
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Definition 10.4 The initial value problem for the ordinary differential 


equation 
u' = f(z,u) (10.2) 


consists in finding a continuously differentiable solution u satisfying the 
initial condition 

u(Zo) = Uo (10.3) 
for a given initial point xo and a given initial value uo. 


The existence and uniqueness of a solution to such an initial value prob- 
lem are settled through the following fundamental theorem. 


Theorem 10.5 (Picard—Lindel6f) Let G € IR"*' be a domain and let 
f:G— RR" be a continuous function satisfying a Lipschitz condition 


IIf(z,u) — f(z, v)|| < Llu — o| (10.4) 


for all (x,u), (z,v) € G and some constant L > 0, which is called the 
Lipschitz constant. Then for each initial data pair (x9, uo) € G there exists 
an interval [x9 — a, 20 + a] with a > 0 such that the initial value problem 
(10.2)-(10.3) has a unique solution in this interval. 


Proof. Firstly, we transform the initial value problem equivalently into the 
Volterra integral equation 


u(x) = up + / ” f(E,u(é)) dé. (10.5) 


Clearly, if u solves the initial value problem, then it follows by integrating 
the differential equation and using the initial condition that u also solves 
the integral equation. Conversely, if u is a continuous solution of the integral 
equation, then by differentiating the integral equation it follows that wu is 
continuously differentiable and satisfies the differential equation. Inserting 
£Z = Xo in (10.5) shows that the initial condition is fulfilled. 

For solving the Volterra integral equation we now can employ Banach’s 
fixed point Theorem 3.45. Since G is open, we can choose a bounded domain 
D such that (zp,uo) € D and D C G. Denote by M a bound on the 
continuous function f : D > IR"; i.e., 


lf(z,u)|| <M, (2,u) € D. 
Since D is open, we can choose a > 0 such that the closed rectangle 
B := {(z,u) € R"™*? : |x — zo| <a, ||u — uol] < Ma} 


is contained in D. Consider the Banach space C'[29 —a, xp +a] of continuous 
functions u : [zo — a, 29 + a] > IR” furnished with the maximum norm 


Ie]loo = | max |lu(z)]| 
|z—zo|<a 
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in terms of the chosen norm || - || on IR”. Each solution u of the integral 
equation satisfies (see (6.1)) 


l|u(2) — uol| = 


[ sue) ig <Ma, |x—20| <a, 


that is, 
||u — Uolloo < Ma, 


which implies that the solution remains within the rectangle B. We consider 
the closed subset 


U := {uw € Clap — a, 20 + a}: |lu — uolla < Ma} 


of the Banach space Crp — a, Zo + a] and note that by Remark 3.40 the 
set U is complete. On U we define an operator A: U — U by setting 


(Au)(e) := uo + / “f(€u(O) dé, |e — 20] <a. 


The operator A indeed maps U into itself, since the function Au is con- 
tinuous and satisfies || Au — ug||.. < Ma. With the aid of the Lipschitz 
condition (10.4) and using (6.1) we can estimate 


||(Au) (x) — (Av)(x) | = 


/ icra) ~ Fé vO) ag 


<L / u(€) — v(6)|| dé < Lallu— voc 


for all |x — xo| < a. Hence 
|| Au — Avllog < Lalu — vlc 


for all u,v € U. Now we choose a such that a < 1/L. Then A:U ~Uisa 
contraction operator, and the Banach fixed point theorem ensures a unique 
fixed point of A, i.e., a unique solution of the integral equation (10.5) in 
the interval [zo — a, ro + al. O 


Exploiting the fact that in Theorem 10.5 the width a of the interval 
is determined by the Lipschitz constant L, which is independent of the 
initial point (29,uo), one can assure global existence of the solution; i.e., 
the solution to the initial value problem exists and is unique until it leaves 
the domain G of definition for the differential equation. 

Note that on a convex domain each function that is continuously dif- 
ferentiable with respect to u satisfies a Lipschitz condition (see the mean 
value Theorem 6.7). 
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Corollary 10.6 Under the assumptions of Theorem 10.5, the sequence 
(u,) defined by up(x) = up and 


tiy41(t) = uo + | f(Ew(€) dé, |e—2o| <a, v=0,1,..., (106) 


converges as v — co uniformly on [xo — a, ro + a] to the unique solution u 
of the initial value problem. We have the a posteriori error estimate 


La 
ju — twtr|loo < i Te lu, — Uy—-t|lo, v=1,2,.... 
Proof. This follows from Theorem 3.46. O 


Example 10.7 Consider the initial value problem 
ul'=a?+u’, u(0) =0, 
on G = (—0.5,0.5) x (—0.5, 0.5). For f(z, u) := 2? + u? we have 
f(a, u)| < 0.5 


on G. Hence for any a < 0.5 and M = 0.5 the rectangle B from the proof 
of Theorem 10.5 satisfies B C G. Furthermore, we can estimate 


[f(a,u) — f(z, v)| = lu’ — v*| = |(ut v)(u — v)| < |u— 9 
for all (x, u), (a, v) € G; i.e., f satisfies a Lipschitz condition with Lipschitz 
constant L = 1. Thus in this case the contraction number in the Picard- 


Lindelof theorem is given by La < 0.5. 
Here, the iteration (10.6) reads 


£ 
wera) = f [e+ us (ela. 
0 
Starting with uo(z) = 0 we first compute 
x 73 
u(z)= f eae= = 
0 
and from Corollary 10.6 we have the error estimate 
Ju — uilleo < [lu — uolloo = 37 = 0.041 
Jur — tr |loo < |lus Uolloc = 54 = 0. eee 


The second iteration yields 
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with the error estimate 


1 
|| — Ualloo < |lue — trlloo = 63.07 0.00012..., 
and the third iteration yields 
6 g¢10 gl 3 pT gl wld 


— f jeg Gy EO dé — or, 
us(2) i E +9 + G89 * 3960| = 3 + 63 * 2079 * 50535 
with the error estimate 


1 1 
lu — uslloo < |lug — alloc = 0.00000047.... 


= S++ 

— 2079-219 = §9535 - 246 

In this example three steps of the Picard—Lindel6f iteration give eight dec- 
imal places of accuracy. However, the example is not typical, since in gen- 
eral, the integrations required in each iteration step will not be available 
explicitly as in the present case. O 


10.2. Euler’s Method 


In the sequel we confine our presentation to the initial value problem for a 
differential equation of the first order. The generalization to systems and 
henceforth to equations of higher order is straightforward. We shall always 
tacitly assume that the assumptions of the Picard—Lindelof Theorem 10.5 
are satisfied. 
The following simple method for the numerical solution of the initial 
value problem 
u'=f(z,u), u(xo) = uo, (10.7) 


was first used by Euler. Given a step size h > 0, it consists in replacing the 
derivative u’ = f(z, u) throughout the interval [%9, 20 +h] by the derivative 
Uy = f (Xo, Uo) at the initial point, i-e., geometrically speaking, by replacing 
the solution by its tangent line at the initial point zo. This leads to the 
approximation 

uy = Uo + hf (Zo, uo) (10.8) 


for the value u(x) of the exact solution at the point 2} = zo +h. Repeating 
this procedure leads to the Euler method as described in the following 
definition. For obvious reason, this method is also known as the polygon 
method, since it approximates the exact solution curve by a polygon. 


Definition 10.8 The Euler method for the numerical solution of the ini- 
tial value problem (10.7) constructs approrimations u; to the exact solution 
u(x;) at the equidistant grid points 


Lji=ayotgjh, j=1,2,..., 
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with step size h by 
Uj41 i= Uy +hf(zj,uj), Jj =0,1,.... 
Example 10.9 Consider the initial value problem 
u=a*+u’, u(0)=0, 


from Example 10.7. Table 10.1 gives the difference between the exact. so- 
lution as computed by the Picard—Lindelof iterations in Example 10.7 and 
the approximate solution obtained by Euler’s method for various step sizes 
h. We observe a linear convergence as h > 0. O 


TABLE 10.1. Numerical example for the Euler method 


0.000333 | 0.000048 | 0.000005 0.000000 
0.001667 | 0.000197 | 0.000020 0.000002 


0.004003 | 0.000446 | 0.000045 0.000005 
0.007357 | 0.000798 | 0.000080 0.000008 
0.011769 | 0.001258 | 0.000127 0.000013 


There are three different interpretations of the approximation formula of 
Euler’s method: 
1. Replace the derivative by the difference quotient 


mer) — leo) ~) u' (zo) = f (£0, uo) 


and solve for u(21). 
2. Integrate in the equivalent integral equation (10.5), i-e., in 


ua) =ulto) +f f(Gsu(6)) dé 
ZO 
approximately by the rectangular rule 


[Hu O) dg = hI (0, uo). 


3. Use Taylor’s formula 
2 


u(x) = u(r) + hu'(z0) + = u' (xo + Oh) 


with 0 < 8 < 1 and neglect the remainder term; i.e., approximate 
u(xr1) ~ u(Zo) + hu' (x9). 
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Each of these three interpretations opens up possibilities for improve- 
ments of Euler’s method. For example, instead of the rectangular rule we 
can use the more accurate trapezoidal rule 


[He we) a6 % co, (a0) + Fler,u(er))) 


which yields 
h 
Uy = Uo + 3 [f (Zo, Uo) + f (21, u4)). (10.9) 
Repeating this procedure leads to the following method. 


Definition 10.10 The implicit Euler method for the numerical solution of 
the initial value problem (10.7) constructs approzimations u; to the exact 
solution u(z;) at the equidistant grid points 


Lj i= 2+ gh, 7=1,2,..., 


with step size h by 


h , 
Ujp1 = Uj+ 5 [Ff (23, uj) + f(rj41,uj4i)], J =0,1,.... 

This method is called an implicit method, since determining uj+1 requires 
the solution of an equation that in general is nonlinear. In contrast, the 
Euler method of Definition 10.8 is an explicit method, since it provides an 
explicit expression for the computation of uj4+1. 


Remark 10.11 The nonlinear equations of the implicit Euler method can 
be solved by successive approximations, provided that the Lipschitz constant 
L for f and the step size h satisfy Lh < 2. 


Proof. We have to solve equation (10.9) for u;. Setting 


g(u) := Uo + : [f(z0, Uo) + f (21, u)] 


we can rewrite (10.9) as the fixed point equation u; = g(u;). The function 
g 1S a contraction, since 


h AL 
lg(u) — g(v)| = 9 |f (21, u) _ f(z1,v)| < 2 ju — v|, 
and therefore the assertion follows from Theorem 3.46. O 


Since the solution of the nonlinear equation (10.9) will deliver only an 
approximation to the solution of the initial value problem, there is no need 
to solve (10.9) with high accuracy. Using the approximate value from the 
explicit Euler method as a starting point and carrying out only one itera- 
tion, we arrive at the following method. 
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Definition 10.12 The predictor corrector method for the Euler method 
for the numerical solution of the initial value problem (10.7), also known 
as the improved Euler method or Heun method, constructs approzimations 
uj to the exact solution u(xz;) at the equidistant grid points 


Lj i= IX9+ gh, 7=1,2,..., 
by 


h . 
Uj = Up + 5 [f (xj, Uj) + f(rj41,uj + hf(xj,u;))], jg =0,1,.... 


Example 10.13 Consider again the initial value problem from Example 
10.7. Table 10.2 gives the difference between the exact solution as computed 
by the Picard—Lindelof iterations and the approximate solution obtained by 
the improved Euler method for various step sizes h. We observe quadratic 
convergence as h — 0. oO 


TABLE 10.2. Numerical example for the improved Euler method 


p=001 | h=0001 


—0.00016667 | —0.00000167 | —0.00000002 
—0.00033326 | —0.00000333 | —0.00000003 


—0.00049955 | —0.00000500 | —0.00000005 
—0.00066530 | —0.00000668 | —0.00000007 
—0.00083027 | —0.00000837 | —0.00000009 


In the following section we will show that the Euler method and the 
improved Euler method are convergent with convergence order one and 
two, respectively, as observed in the special cases of Examples 10.9 and 
10.13. 


10.3 Single-Step Methods 


We generalize the Euler methods into more general single-step methods by 
the following definition. 


Definition 10.14 Single-step methods for the approximate solution of the 
initial value problem 
ul’ = f(z,u), u(xo) = uo, 


construct approzimations u; to the exact solution u(x;) at the equidistant 
grid points 
Lj i= Xo + 7h, 7=1,2,..., 
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with step size h by 
Ujqi = uj t+hy(s;,uj;;h), j =0,1,..., 


where the function y : G x (0,00) > R is given in terms of the right-hand 
side f :G — R of the differential equation. 


Example 10.15 The Euler method and the improved Euler method are 
single-step methods with 


p(z,u;h) = f(z, u) (10.10) 

and 1 
plz, ush) = 5 [f(z,u) + f(a thu t hf(z,u))], (10.11) 
respectively. 0 


The function y describes how the differential equation 
ul = f(z,u) 
is approximated by the difference equation 
1 
, lulz + A) — u(a)] = vl, u;h). 


From a reasonable approximation we expect that the exact solution to the 
initial value problem approximately satisfies the difference equation. Hence, 


[u(a +h) — u(x)| — p(z,u;h) 70, AO, 


must be fulfilled for the exact solution u. We also expect that the order of 
this convergence will influence the accuracy of the approximate solution. 
These considerations are made more precise by the following definition. 


Definition 10.16 For each (x,u) € G denote by n = n(€) the unique 
solution to the initial value problem 


n =f(En), n(z) =4, 
with initial data (x,u). Then 
1 
A(o,ush) = > [nlar +h) ~ n(2)] — ole, ush) 


ts called the local discretization error. The single-step method is called con- 
sistent (with the initial value problem) if 


lim A(z,u;h) = 0 

uniformly for all (x,u) € G, and it is said to have consistency order p if 
|A(az,u;h)| < Kh? 

for all (x,u) € G, allh > 0, and some constant K. 
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Without loss of generality, in the sequel we always will assume that f 
(and later also derivatives of f) are uniformly continuous and bounded on 
G. This can always be achieved by reducing G to a smaller domain. 


Theorem 10.17 A single-step method is consistent if and only if 
lim y(z,u;h) = f(x, u) 
h—0O 

uniformly for all (x, u) € G. 


Proof. Since we assume f to be bounded, we have 


ne +t) —n(2) = [ f@+s)ds= | f(x+s,n(21+s))ds>0, t-0, 


uniformly for all (x,u) € G. Therefore, since we also assume that f is 
uniformly continuous, it follows that 


h 
; [ In'(2 +t) —n'(a)] dt} < max |n'(2 + t) — n'(2) 


= max |f(e + t,n(2 +)) — f(e,n(z))| 40, h+0, 


uniformly for all (z,u) € G. From this we obtain that 


A(e,ush) + p0,u;h) ~ f(0,u) = = (nl +h) = n(2)] a (o) 


1 rh 

; | [n'(a2 + t) — n/(x)] dt > 0, h- 0, 
0 

uniformly for all (z,u) € G. This now implies that the two conditions 

A->0,h- 0, and y > f, h > 0, are equivalent. O 


Theorem 10.18 The Euler method is consistent. If f 1s continuously dif- 
ferentiable in G, then the Euler method has consistency order one. 


Proof. Consistency is a consequence of Theorem 10.17 and the fact that 
y(x,u;h) = f(x,u) for Euler’s method. If f is continuously differentiable, 
then from the differential equation 7’ = f(£,7) it follows that 7 is twice 
continuously differentiable with 


nm = fe(E,n) + ful€, mF (E,n)- (10.12) 


Therefore, Taylor’s formula yields 


JAC, us h)| = |= [nla + A) — n(x] — 9'(@)| = 5 bt"(@ + Oh) < KA 


for some 0 < 6 < 1 and some bound K for the function 2(f, + fuf). 0 
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Theorem 10.19 The improved Euler method is consistent. If f 1s twice 
continuously differentiable in G, then the improved Euler method has con- 
sistency order two. 


Proof. Consistency follows from Theorem 10.17 and 
1 
plz, ush) = 5 [f(z,u) + flethuthf(z,u))] > f(z), ho. 


If f is twice continuously differentiable, then (10.12) implies that 7 is three 
times continuously differentiable with 


nl" = feo(E,n) + 2faulE, mF (Em) + fuulE, mF°(En) 
+ ful€,m)fa(Em) + fa(E.n) £(En)- 
Hence Taylor’s formula yields 


h? he 
n(x +h) — n(x) — hy'(z) — > 9"(#)) = | In" (e+ A) S K,h* (10.13) 


for some 0 < 6 < landa bound K, for 6(fer+2feuft+fuuf?+fufetf2f). 
From Taylor’s formula for functions of two variables we have the estimate 


f(a + hu +k) ~ f(a,u) — hfela,u) — bul, u)| <5 Ka(\h| + |hI)? 


with a bound Ko for the second derivatives frz2, fey, and fu. From this, 
setting k = hf(z,u), in view of (10.12) we obtain 


1 
If(e+h,u+ hf(x,u)) — f(@,u) — hy!"(z)| < 5 Ko(1 + Ko)*h’ 
with some bound Ko for f, whence 
yp(x,u;h) — f(2,u) — ; n' (a)! < ; K2(1+ Ko)*h? (10.14) 


follows. Now combining (10.13) and (10.14), with the aid of the triangle 
inequality and using the differential equation, we can establish consistency 
order two. CJ 


We proceed by investigating the convergence of single-step methods as 
the step size h tends to zero. This is done for the solution to the initial 
value problem in a fixed interval [a,b] with initial data at 29 = a and the 
step size h and the number n of steps chosen such that z, = b. 


Definition 10.20 Assume that in the interval [a, b| at the equidistant grid 
points 
Lj:=IX@yt+ gh, 7=0,1,...,n, 
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with Io = a and Zp = b, approximate values u; for the solution u(z;) to 
the initial value problem 


u'=f(a,u), ul(zo) = uo, 
are obtained by a single-step method. Then 
ej =ej(h) :=u;—u(z;), 7 =0,1,...,n, 
is called the global error, and 


E = E(h):= max ej(h)| 


is called the maximal global error. The single-step method is called conver- 


gent if 
lim E(h) = 0, 
h-0 


and it 1s said to have convergence order p if 
E(h) < Hh? 
for allh > 0 and some constant H. 
The following lemma is needed for our convergence analysis. 


Lemma 10.21 Let (€;) be a sequence in IR with the property 
Ifrrul SU + Allg] +B, 7 =0,1,..., 
for some constants A >0 and B > 0. Then the estimate 
Kil <eole4 + (4-1), 9 =0,1,..., 


holds. 


Proof. We prove this by induction. The estimate is true for 7 = 0. Assume 
that it has been proven for some 7 > 0. Then, with the aid of the inequality 
1+A < e4, which follows from the power series for the exponential function, 
we obtain 


Esai] < (1+ A)léole’* + (1+ A) a (e?4-1)+B 


< |EleGt D4 4 4 (eA _ 1); 


i.e., the estimate also holds for 7 + 1. oO 
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Theorem 10.22 Assume that the function yp describing the single-step 
method is continuous (also with respect to h) and satisfies a Lipschitz con- 
dition; 1.e., 

loa, u;h) — p(a,v;h)| < Mlu— ol 
for all (x, u), (a, v) € G, all (sufficiently small) h, and a Lipschitz constant 
M. Then the single-step method is convergent if and only tf tt 1s consistent. 


Proof. We first show that consistency implies convergence and assume that 
the single-step method is consistent. For the difference of two consecutive 
errors we compute 


ej41 — ej = [ujti — uy] — [u(aj41) — u(z,)] 
= hp(zj,uj;h) — [u(aj41) — u(z;)] 


= hlip(x;, uj; h) — p(xj, u(x;);h) — A(x;,u(z,); A)]. 


Hence 
lejt1 — €j| < A[M|uj — u(x;)| + c(A)], (10.15) 
where 
c(h) = max, |A(x, u(x); h)| 
satisfies 


c(h) > 0,h > 0, 
since we assume consistency. The inequality (10.15) implies that 
lej+1| < (1+ hM)le;| + he(h), 7=0,1,...,n. 


From this, applying Lemma 10.21 for A = hM and B = hc(h) and using 
€g = 0, we obtain the estimate 


h 
le;| < a) (eM (5-20) - 1), j=0,1,...,n. (10.16) 


This establishes the convergence 


h 
E(h) < ) (eMe-a) - 1) +0, AO. 


We now show that convergence implies consistency and assume that the 
single-step method is convergent; i.e., for h > 0 the approximations 


Uj+1 i= uj + hy(az;,u;;h) (10.17) 
converge to the solution of 


u(x) = f(z,u), u(x) = uo, 
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for all initial data (29, uo) € G. We set 


g(z,U) := p(x, u; 0) 


and observe that by Theorem 10.17 the single-step method is also consistent 
with the initial value problem 


u(x) =9(2,u), u(x) = uo. (10.18) 


Since we have already shown that consistency implies convergence, the 
approximations (10.17) also converge to the solution of (10.18); i.e., the 
solutions of the two initial value problems coincide. Therefore, we have 
f (xo, Uo) = g(Xo, Uo), and since this holds for all (%9,uo) € G, from the 
continuity of y we conclude uniform convergence: 


y(x,u;h) > f(z,u), h0. 
Now consistency follows from Theorem 10.17. 0 


Theorem 10.23 Assume that the single-step method satisfies the assump- 
tions of the previous Theorem 10.22 and that it has consistency order p; 
t.e., |A(a,u;h)| < Kh?. Then 


le;| < x (eM(es-#0) -1) AP, j =0,1,...,n 


1.e., the convergence also has order p. 
Proof. This follows from (10.16) with the aid of c(h) < Kh?. Oo 


Corollary 10.24 The Euler method and the improved Euler method are 
convergent. For continuously differentiable f the Euler method has conver- 
gence order one. For twice continuously differentiable f the improved Euler 
method has convergence order two. 


Proof. By Theorems 10.18, 10.19, 10.22, and 10.23 it remains only to verify 
the Lipschitz condition of the function y for the improved Euler method 
given by (10.11). From the Lipschitz condition for f we obtain 


lp(z,u; h) — v(a,v; h)| 


<5 \f(x,u) — f(z,v)| + : \f(c+h,uthf(z,u)) — f(x+h,v+hf(z,v))| 


‘ 


ju—vl-+ [fut Af(eu)] ~ fot hf(a,v)]| <L (1+ *) lu — oI; 


$9 D) 


i.e., y also satisfies a Lipschitz condition. O 
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Single-step methods of higher order can be constructed as follows. For a 


set of real numbers sz, = 2,...,m,cg,i1=1,...,4-—1, €=2,...,m, and 
ag, £=1,...,m, the quantities 
ky — f (xj, u;), 


ky = f(z; + sgh, u; + Co kh), 


k3 = f(a; + s3h, Uz + e31k,h + c32keh), 


m—l 
km = f (: + 8mh,uj +h S> ent 


i=1 


are computed recursively, and then the approximation is obtained by 


m 
Uj+1 = Uz + h > ajk;. 

i=1 
The Euler method is described by m = 1 and a, = 1 and the improved 
Euler method by m = 2, s2 = 1, co: = 1, and a, = a2 = 1/2. The basic 
goal in the design of higher-order methods is, for a given m, to determine 
the coefficients such that the order of consistency and hence the order of 
convergence becomes as large as possible. As an example, we shall consider 
the Runge-Kutta method, which is the most widely used and most suc- 
cessful single-step method. It was introduced by Runge in 1895 for a single 
differential equation and extended to systems of differential equations by 
Kutta in 1901. 


Definition 10.25 The Runge-Kutta method for the numerical solution of 
the initial value problem (10.7) constructs approximations u; to the exact 
solution u(x;) at the equidistant grid points 


L;:= 2% + gh, 7=1,2,..., 
with step size h by using the above higher-order method with 
ky = f (@j, U5), 
h h 
m=f(a+5.u+5h), 


h h 
kn =F (2j+ 5 uj +5 he), 


ky = f(a; + h,u; + hks3), 
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and i 
Uj41 = Uj + 6 (ky + 2k + 2k3 + ky). 


For the differential equation u' = f(x) the Runge-Kutta method coin- 
cides with Simpson’s rule for numerical integration. 


Theorem 10.26 The Runge-Kutta method is consistent. If f is four-times 
continuously differentiable, then it has consistency order four and hence 
convergence order four. 


Proof. The function y describing the Runge-Kutta method is given recur- 
sively by 
1 
P= & (v1 + 2y2 + 293 + ya), 


where 
Y1 (x, u; h) — f(z,u), 


h h 
y2(x,u;h) = f (2+ 5 ,uU+ 3 pi(t,ush) 


h h 
3(2x, u; h) — f (x4 9 sut+ 9 pa(a,ush) ) ) 


ya (2, U; h) = f(x + h,u+ hip3(z, u; h)). 
From this, consistency follows immediately by Theorem 10.17. 
Analogously to the proof of Theorem 10.18 for the improved Euler method, 
the consistency order four can be established by a Taylor expansion of 
~p(x,u;h) with respect to powers of h up to order h* and expressing the 
derivatives of 7 on the right-hand side of 


2 3 
> [n(x +h) ~n(2)] = n!(2) + 5 af"() + na) + 5 a"(@) + O(8) 


24 
through f and its derivatives by using the differential equation. We leave 
the details as an exercise for the reader (see Problem 10.9). O 


The error estimate in Theorem 10.23 is not practical in general, since the 
constants M and K have to be determined from higher-order derivatives of 
f. Therefore, in practice, the error is estimated by the following heuristic 
consideration. For convergence order p, the error between the approximate 
solution u(z;h) at the point xz, obtained with step size h, and the exact 
solution u(x) satisfies 

u(x;h) — u(a) & chP 


for some constant c. Correspondingly, for step size h/2 we have that 


tu (« 3 —u(xz) sc (3). 
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Subtracting these relations yields 


u(x;h) -—& (« 5) SC (3) (2? — 1). 


Now the constant c can be eliminated from the last two relations, with the 
result that 


F (= 3 — ule) © — ate h) —@ (« 3) | (10.19) 


Hence we may consider (10.19) as an estimate for the error occurring with 
the smaller step size h/2. However, we need to keep in mind that (10.19) 
does not provide an exact bound and might fail in particular situations. 
Nevertheless, it can be used for controlling the step size during the course of 
the numerical calculations in order to adjust the actual step size according 
to the required accuracy. 

Solving for u(x) in (10.19) yields 


2Pu (« 5 | — u(z;h) 


10.2 
Pj (10.20) 


u(x) & 


We leave it as an exercise for the reader to interpret (10.20) as a Richard- 
son extrapolation, which we explained in detail for the case of numerical 
integration in Section 9.5. 


10.4 Multistep Methods 


In the single-step methods each computed function value of f is used only 
in one step. It is natural to try to design methods where each computed 
function value of f is used in several steps. This leads to multistep methods, 
as described in the following definition. 


Definition 10.27 Multistep methods for the approzimate solution of the 
initial value problem 
u' = f(z,u), ul(zo) = uo, 


construct approximations u; to the exact solution u(z;) at the equidistant 
grid points 
£j,:i=Xo9+7h, jg =1,2,..., 


with step size h by 
Uj+r + Qr—1Uj4+r—1 tee t aguzj = hip(x;, Ujy+++,)Ujtr—-1; h) 


for 3 =0,1,.... Here yp is a function of r+ 2 variables given in terms of 
f, and ao,...,@,-—1 are constants. 


244 10. Initial Value Problems 


To start such a multistep method involving r steps, r starting values 
U0, U1,---,Ur—1 are required. For example, these can be approximately 
computed from the initial value up by a single-step method such as the 
Runge-Kutta method. 

A particular class of multistep methods is obtained by approximating 
the integral in 


Tj+r 
u(tjsr) —ultjer-e) =f f(Gul6)) dé 
Cj+tr—k 
with 1<k <r by an interpolatory quadrature, i.e., by 
Lj+r Ljtr 
[ seuenage [ vie)as, 
Tj+r-—k Ti+r—k 


where p € P, with 0 < s <r is the uniquely determined polynomial with 
the interpolation property 


P(Lj+m) = f(Zj+m,Uj+m), m=0,...,8, 
i.e., by setting 
Li+r 
Uj+r — Uj+r—k = / p(§) dé. (10.21) 
Lji+tr—k 


Integrating the Lagrange representation (8.2) of the interpolation polyno- 
mial shows that these multistep methods are of the form 


s 
Uj+tr — Uj+r—k = h S- bmn f(Lj4m, Uj+m) 
m=0 
with coefficients bp,...,6, depending on r, k, and s. 

From (10.21) we can generate a variety of methods by choosing the num- 
ber of steps r, the number s + 1 of interpolation points, and the number 
k of integration intervals appropriately. We briefly report on some of these 
methods. 

The Adams-Bashforth method, introduced by Adams and Bashforth in 
1883, is obtained by taking k = 1 and s = r—1. For r = 1 the interpolation 
polynomial is a constant, and therefore 


Ujpi = Uy + hf (az, u;). (10.22) 
For r = 2 the interpolation is linear and leads to 
h 
ujpa = Ujsi + 5 [SF (j+1, Uj+1) — F(e;, 45) (10.23) 


(see Problem 10.12). The Adams-—Bashforth method is explicit. Clearly. 
(10.22) coincides with the Euler method from Definition 10.8. 
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The Adams—Moulton method, devised by Moulton during World War I, 
is given by k = 1 and s = r. For r = 1 the interpolation is linear, whence 


h 
uje1 = Uj + 5 [F(tj41, Uj41) + f(2;, Us)]. (10.24) 


For r = 2 the interpolation is quadratic, leading to 


h 
uj42 = Ujsr + 75 [BS (@j42,Uj+2) + BF (Tj41, Uj41) — F(2j,Us)] (10.25) 


(see Problem 10.12). The Adams—Moulton method is implicit. One iteration 
step for the solution of the nonlinear equation for uj, starting with the 
approximation given by the corresponding Adams—Bashforth method leads 
to a predictor corrector method. Clearly, (10.24) coincides with the implicit 
Euler method from Definition 10.10. 

The explicit method for k = 2 and s = r — 1 is known as the Nystrom 
method, and the implicit method for k = 2 and s = r is called the Milne- 
Thomson method (see Problem 10.14). 


Definition 10.28 For each (z,u) € G denote by n = n(€) the unique 
solution to the initial value problem 


n = f(E,n), n(x) =U, 
for the initial data (z,u). Then 
1 r—l 
A(z,u;h) := — |n(a+rh) + S- Amn(z + mh) 
h m=0 


—~(2, n(x), ose W(x + (r — 1)h); h) 


is called the local discretization error. The multistep method is called con- 
sistent (with the initial value problem) if 


lim A(z,u;h) = 0 

uniformly for all (x,u) € G, and it is said to have consistency order p if 
|A(z,u;h)| < Kh? 

for all (z,u) € G, allh > 0, and some constant K. 


Theorem 10.29 Jf f is (s+ 1)-times continuously differentiable, then the 
multistep methods (10.21) are consistent of order s + 1. 


Proof. By construction we have that 


z+rh 
A(x, u;h) = > / [F (é,u(€) — pled, 


h +(r—k)h 
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where p denotes the polynomial satisfying the interpolation condition 
p(2 + mh) = f(x+mh,n(x+mh)), m=0,...,8. 


By Theorem 8.10 on the remainder in polynomial interpolation, we can 
estimate 


If(E,n(€)) — p(é)| < Khe 


for all € in the interval x + (r —k)h < € < x+,rh and some constant K 
depending on f and its derivatives up to order s + 1. Oo 


Analyzing the convergence for multistep methods is more involved than 
for single-step methods for the following two reasons. Firstly, the approx- 
imation obtained by a multistep method is, of course, also influenced by 
the errors 

ej :=uj;—u(zj), j =0,...,r—-1, 


in the starting values. Hence we give the following definition. 


Definition 10.30 The starting values u;, 7 =0,...,7—1, are called con- 
sistent if 

li ‘(h) — )] = }=0,...,r—1. 

lim [uj (h) u(2x;)] 0, J 0, vr 


They are said to have consistency order p if 
juj(h) — u(a;)| < KAP, j3=0,...,r—1, 
for all h > 0 and some constant Kk". 


To make sure that the consistency order of the starting values coincides 
with the consistency order of the multistep method, the single-step method 
for computing the starting values has to be chosen accordingly. 

Secondly, multistep methods can be unstable, as illustrated by the fol- 
lowing example. 


Example 10.31 Let p be the quadratic interpolation polynomial satisfy- 
ing 
p(z;) =u(a;), jg =0,1,2, 


and approximate 
u'(xo) © p'(Zo0). 
Using the fact that the approximation for the derivative is exact for poly- 


nomials of degree less than or equal to two, simple calculations show that 
(see Problem 10.15) 


p (x0) = = [—u(a2) + 4u(r1) — 3u(ao)]. (10.26) 
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If u is three times continuously differentiable, by Theorem 8.10 we have 


1 mn 


TS 


IL— Xo 


< 


<5 llu"llool(e ~ 21)(a — 22), 


and from this, passing to the limit « > zo, it follows that the error for the 
derivative can be estimated by 


h2 
ju'(xo) — p'(ao)| < > [lu Iloo: (10.27) 
By approximating 
p (zo) © u(zo) = f(xo, uo) 
we derive a multistep method of the form 
Uj+o — 4uj41 + 3u; = —2hf (x;, u;), 7=0,1,.... (10.28) 


From (10.27) it follows that (10.28) is consistent with order two if f is twice 
continuously differentiable. 
Now we consider the initial value problem 


with the solution u(x) = e~*. Here the multistep method (10.28) reads 
Uj+4+2 — 4uj41+ (3 — 2h)u; =Q, jy =0,1,.... (10.29) 


Table 10.3 gives the error e; = u;—e * between the approximate and exact 
solutions for the step sizes h = 0.1 and h = 0.01. For the starting values, 
ug = land wu, =e” have been used with ten-decimal-digits accuracy. The 
last column gives the quotient q; := e;/e;—1 of the error in two consecutive 
steps. 


TABLE 10.3. Numerical results for Example 10.31 


0.0109 | 3. 0.0000 


0.1123 | 3. 0.0099 
1.0858 | 3. 2.4456 
10.4143 | 3. 604.1985 


In order to explain the numerical failure indicated by the results in Table 
10.3, we solve the difference equation (10.29) by looking for solutions of the 
form 

uj =a, (10.30) 
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where a and A are complex numbers. Substituting into (10.29) shows that 
(10.30) solves (10.29) if and only if \ is a solution of the so-called charac- 
teristic equation 

dN? — 4 + (3 — 2h) = 0. 


This quadratic equation has two solutions, namely 
Aig =2F V1 + 2h. 
Therefore, the general solution of (10.29) is given by 
uj = ad + bX. 


The two constants a and b are determined by the conditions ug = 1 and 
u, =e” and have the values 


=] h? 
ar + O(h*) 
and 
h = eh _ M1 
Ag — Ay 


The term an! in the solution to the difference equation approximates the 
solution e~*s = e~J" to the initial value problem, since 
ad = (1+ O(h?)] (1 —h+ O(h?)P & e77. 
However, the additional term br grows exponentially, and the relation 
u; — u(r; 
i = wey) = Ap =3+h+O0(h?) 
Uj—1 — U(2;-1) 


explains the last column of Table 10.3. D 


Roughly speaking, for multistep methods with r > 2, the (homogeneous) 
difference equation of order r occurring in the multistep method has r lin- 
early independent solutions, whereas the approximated differential equa- 
tion has only one solution. Hence only one of the solutions to the difference 
equation corresponds to the differential equation. Therefore, convergence 
of the multistep method can be expected only when the additional solu- 
tions to the difference equation remain bounded. Note that these additional 
solutions will always be activated by errors in the starting values and by 
round-off errors. For this reason we proceed by investigating the stability 
of the difference equation. 


Definition 10.32 The linear difference equation 


r—l 
Ujtr t+ S- OmUjim =0, J =0,1,..., (10.31) 
m=0 
with constant coefficients ag,...,@,—1 18 called stable if all its solutions are 


bounded. 
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Theorem 10.33 The linear difference equation (10.31) is stable if and 
only if it satisfies the root condition, 1.e., if all the zeros X of the charac- 
teristic polynomial 


r—1 
p(A) =A" + So amd” (10.32) 
m=0 


have absolute value |A| <1, and zeros satisfying |A| = 1 are simple zeros. 


Proof. We begin by noting that each solution to the difference equation 
(10.31) is uniquely determined by its r initial values up, uj,...,U-—1. Ob- 
viously, from these initial values the remaining terms u,,u,;41,... are re- 
cursively determined by (10.31). 

For convenience we set a, = 1 and denote by A the differential operator 
given by (Af)(A) = Af’(A). Then for the sequence 


uj=j"M, j=0,1,..., (10.33) 
we have that 
r—1 r 
Ujar + S- AmUj+m = > Am(j +m)"’t™ 
n rT 
— \J (;,)" S- Am mr-k ym 
k=0 m=0 


From this it can be deduced that if A is a zero of the characteristic poly- 
nomial p of multiplicity s, then for n = 0,1,...,s —1 the sequence (10.33) 
solves the difference equation. 

Now assume that 1,...,A, are the zeros of the characteristic polynomial 
(10.32) and have multiplicities s,,..., 84; i-e., 


k 
p(A) = [JQ -»)*. 
l=1 
Then the general solution of the homogeneous difference equation (10.31) 


is given by 
k s,-—1 


uj= >) > aisj*h (10.34) 
l=1 s=0 
with r arbitrary constants a;,. To establish this we need to show that the 
coefficients a;, can be chosen such that arbitrarily given initial conditions 
k s,-1 


So do asi? = uj, § =0,...,7-1, (10.35) 
{=1 s=0 
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are fulfilled. The homogeneous adjoint system to the system (10.35) reads 


r—l 
S > Bj°M =0, 8 =0,...,87-1,1=1,...,k. 
j=0 
Assume that 8;, 7 = 0,...,r — 1 is a solution. Then the polynomial 
r—l 
q(A) = 5° Bi» 
j=0 


of degree r — 1 has the zeros ; with multiplicity s; for 1 = 1,...,k; i.e., 
the polynomial has r zeros and therefore, by Theorem 8.1, must vanish 
identically. This implies 69 = --- = 8,1 = 0. Hence, for each given right- 
hand side the system (10.35) has a unique solution. 

Now from the form (10.34) of the general solution to the difference equa- 
tion, the equivalence of stability and the root condition is obvious. oO 


Besides the solution (10.34) of the homogeneous difference equation, we 
also will need an explicit expression for the solution to the inhomogeneous 
difference equation. 


Lemma 10.34 Fork =0,1,...,r—1, let uj, denote the unique solutions 
to the homogeneous difference equation (10.31) with initial values 


Ujk = OF,k; 7=0,1,...,r—1. 


Then for a given right-hand side c,,Cr41,.--, the unique solution to the 
inhomogeneous difference equation 


r—1 
Zjtrt+ Yo Amzjtm = Ctr, J =0,1,..., (10.36) 
m=0 
with initial values 29, 21,...,2r—1 18 given by 
r—1 j 
Zier = > ee Ujerk t > Cher Ujtr—k—iyr—1, J =0,1,.... (10.37) 
Proof. Setting um,r—1 = 0 for m = —1,—-2,..., we can rewrite (10.37) in 
the form 
r—l 
25 = >) ze lye + wy, 7=0,1,..., 
k=0 
where 


CO 
Wj i= S Ch+rUj—k—1,r-1, J =0,1,.... 
k=0 
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Obviously, w; = 0 for 7 = 0,...,r — 1, and therefore it remains to show 
that w; satisfies the inhomogeneous difference equation (10.36). 

As in the proof of Theorem 10.33 we set a, = 1. Then, using um ,--1 = 0 
for m < r—1, Up—1,r-1 = 1, and the homogeneous difference equation for 
Um,r—1, we compute 


T 


Tr 1. @) 
S amWjtm = 5 am S Ck+rUj+m—k—-1,r—-1 
m=0 


m=0 k=0 


r j 
_ S am 5 Ck+rUj+m—k-1,r—1 
k=0 


m=O 


j r 
= S Ck+r y amUj4+m—k—1,r—1 = Cj+r- 
k=0 m=0 


Now the proof is completed by noting that each solution to the inhomo- 
geneous difference equation (10.36) is uniquely determined by its r initial 
values 20, 21,--.,2r—1- oO 


Definition 10.35 The multistep method of Definition 10.27 ts called sta- 
ble if the associated difference equation 


r—l 


m=0 


is stable. 


Single-step methods are always stable, since the associated difference 
equation u;41 — uj = 0 clearly satisfies the root condition. 


Remark 10.36 The multistep methods (10.21) are stable. 


Proof. The corresponding characteristic polynomial p(X) := A” — AT~* ful- 
fills the root condition. O 


For establishing convergence of multistep methods, we will need the fol- 
lowing extension of Lemma 10.21. 


Lemma 10.37 Let (£;) be a sequence in R with the property 


j-1 


GIS A> l&ml+B, §=1,2,..., 


m=0 


for some constants A > 0 and B > 0. Then the estimate 
Es] < (AlGo] + BleF4, f= 1,2,..., 
holds. 
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Proof. We prove by induction that 
IEj| < (Alfo] + B)+ A)", 7 =1,2,.... (10.38) 


Then the assertion follows by using the estimate 1+A < e4. The inequality 
(10.38) is true for 7 = 1. Assume that it has been proven up to some 7 > 1. 
Then we have 

j j 
IGjn11 S ADS [&ml + BS (AlGol + B) + A >_ [Algo] +.B) + A)™™ 


m=0 m=1 
= (Alfo| + B)(1 + A)’; 
i.e., the estimate is also true for 7 + 1. Oo 


Theorem 10.38 Assume that the function ~ describing the multistep method 
is continuous and satisfies a Lipschitz condition; 1.e., 


r—1 


|o(z, Uo, U1,-..,Ur—1; A) — P(X, V0, V1,-.-,Ur—1;h)| < M S- |Um — Um| 


m=0 


for all (x, uo),...,(@,Up—1)(Z, U0), ---, (Z, Ur_1) € G, all (sufficiently small) 
h, and a Lipschitz constant M. Furthermore, assume that the multistep 
method is consistent and stable and that the starting values are consistent. 
Then the multistep method is convergent. If both the multistep method and 
the starting values have consistency order p, then the convergence also is 
of order p. 


Proof. (Compare to the proof of Theorem 10.22.) For the errors 


ej = uy — u(z;) 


we obtain 
r—l r—l r—1 
Cjtr + 5 } AmCj+m = Ujtr + ) } AmUjt+m — U(Xj+r) — 5 } AmU(Lj+m) 
m=O m=0 m=O 


= hy(z;, Ujy--+,Ujtr—13 h) — hA(a;, u(x; ); h) 


—hp(x;, u(x;), ve ,U(Lj+r—1); h). 
We rewrite this into the form 


r—l 


ejar + D> Omej+m =hejtr, J =0,1,..., (10.39) 


m=0 
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where 
Cjtr = (Lj, Uj,..-,Uj+r—1;h) — A(aj,u(a;); h) 
—Y(z;, u(z;), oe UL j4r—1)} h). 
We can estimate the right-hand side by 


r—1 
lejtr] <M Y- lejeml te(h), J =0,1,..., (10.40) 


m=0 
where 
c(h) = max x, |Ale, u(x); h)| 


satisfies c(h) + 0, h - 0, since we assume consistency. By Lemma 10.34 
we can express the solution of (10.39) in the form 


r—l 
€jtr = Sen urn th Yo Cree yer k—1,r-1; j = 9, 1,. 
k=0 k=0 


From this, since we assume stability, we can estimate 


j 
lej+r| <W {d(h)+ HY lnel j3 =0,1,..., 
k=0 


for some constant N and 


d(h) : -y lex |. 


We note that d(h) — 0, h — 0, since the starting values are assumed to be 
consistent. Inserting (10.40) into the last inequality now yields 


j r—-l 
les+rl wf a +hM D7 Dd) lee+m| + (5 + ite}. j=0,1,.... 


k=0 m=0 
Because of 
j r-1 r—1 m+) r+j—1 r+j—1 
dD dy lerrml = DL Dd lead <r » lex| = 1 » jex| + rd(h) 
k=0 m=0 m=0 k=m 


and (j + l)h < £341 — Zo < 2(b— a) we obtain that |e,| < Cy(h) and 


r+j—1 
esol SCY 8) +h S- eal} j=1,2,..., 
k=r 
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for some constant C' and y(h) := d(h) + c(h). Now Lemma 10.37 implies 
that 


lejarl < Cller|h+y(AyJeF-VO* < CL 4+ Ch)y(hyeF-*9°, f= 1,2,..., 


whence 
E(h) < C(1+ Ch)y(hje®-%° 40, hh 0, 


follows, since y(h) — 0,h — O. For consistency order p we have that 
y(h) = O(AP); i.e., the convergence is also of order p. Oo 


The basic advantage of multistep methods results from the fact that for 
arbitrary convergence order, in each step only one new evaluation of the 
function f is required. In contrast, for single-step methods the number of 
function evaluations required in each step is equal, in general, to the conver- 
gence order. Therefore, multistep methods are much faster than single-step 
methods. However, it should be noted that readjusting the step size dur- 
ing the computation is more involved due to the need to recompute the 
corresponding starting values for the new step size. 


Problems 
10.1 Find the exact solution of the initial value problem 
ui =—u’, u(0)=1, 


and compare it to the approximate solutions obtained by successive approxima- 
tions according to Corollary 10.6. Compute the third iterate ug and compare the 
exact error u — u3 to the a posteriori error estimate from Corollary 10.6. 


10.2 Consider the initial value problem uv’ = u, u(0) = 1, and show that the 
approximate solution from the Euler method is given by u; = (1+ h)’. 


10.3 Find the exact solution of the initial value problem 
' u 
u=2-, u(l)=1. 
=, u(1) 
Determine an analytic expression for the approximate solution by Euler’s method 


and verify the convergence order one predicted by Theorem 10.18. 


10.4 Show that Euler’s method fails to approximate the solution u(x) = € x) 3/2 


of the initial value problem u’ = u'/?, u(0) = 0. Explain this failure. 


10.5 Show that the differential equation u’ = az with a € R is solved exactly 
by the improved Euler method. 


Problems 255 


10.6 Show that the single-step method 
h h 
Uj+1 = uj thf (2; touts f(xj,u;)) 


has consistency order two if f is twice continuously differentiable. This method 
is known as the modified Euler method. 


10.7 Show that the single-step method given by 


ky = f (xj, u;), 


h h 
ko =f (aj +> ,uj +5 hs), 


2h 2h 
ke =f (apt uj +> ke), 
and 


h 
Uj41 = Uj + 1 (ki + 3k3) 


is consistent and has consistency order three if f is three-times continuously 
differentiable. This method is known as Heun’s third-order method. 


10.8 Show that the single-step method given by 


ki = f(x;,u;), 


h h 
ko =f (xj+ 5 uj t+5 hh), 


k3 = F(a; +h,u; — hki + 2hk2), 


and 
h 
Uj41 = Uj + 6 (ki + 4ko + kz) 


is consistent and has consistency order three if f is three-times continuously 
differentiable. This method is known as Kutta’s third-order method. 


10.9 Show that the Runge-Kutta method (see Definition 10.25) has consistency 
order four if f is four-times continuously differentiable. 


10.10 Write a computer program for the Runge-Kutta method and test it for 
various examples. 


10.11 The population p = p(t) and gq = q(t) of two interacting animal species 
that have a predator prey relationship is modeled by the system of the Lotka-— 
Volterra equations 

2 = ap + Bpq, a = yq + pq 
with constant coefficients a < 0, 8 > 0, 7 > 0, and 6 < 0, complemented by initial 
conditions p(0) = po and q(0) = qo. (Explain the significance of the signs of the 
constants for the model.) For the coefficients a = —1, 8 = 0.01, y = 0.25, and 


6 = —0.01, test the stability of the solutions by solving the initial value problem 
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numerically by the Runge-Kutta method for the four different initial conditions 
po = 30 +1 and gq = 80 +1. Visualize the numerical results by a phase diagram, 
i.e., by the curve {(p(t), q(t) : t € [0, T]} for sufficiently large T > 0. 


10.12 Verify the coefficients in the Adams—Bashforth and Adams—Moulton meth- 
ods (10.22)—(10.25). 


10.13 Determine the coefficients of the Adams—Bashforth and Adams—Moulton 
methods for r = 3. 


10.14 The multistep methods (10.21) for k = 2 and s = r —1 and for k = 2 
and s = r are known as the Nystrom method and the Milne-Thomson method, 
respectively. Determine the coefficients of the Nystrom method and the Milne— 
Thomson method for r = 1 and r = 2. 


10.15 Verify the coefficients in the difference formula (10.26). 
10.16 Construct a two-step method of the form 
Uj+2 + a1uj4i + aou; = Albof (xj, uy) + bi f (2741, uj+1)] 
that has consistency order two and discuss its stability. 
10.17 Find the general solution of the difference equation 
Uj4+2 — 2auj41 + au; = 1 
for 0 < a < 1. Show that limj.0 uj = 1/(1 — a). 


10.18 Find an explicit expression for the Fzbonacct numbers a;, which are de- 
fined by a9 = a1 = 1 and aj;41 = a; + a;-1 for 7 > 1. Is the root condition of 
Theorem 10.33 satisfied? 


10.19 Attempt to approximate the unique solution u(x) = 2 of the initial value 
problem 
u=su(u-2), u(0)=2, 
numerically by any of the methods described in this chapter. Discuss the results 
by relating them to the solution of the initial value problem with perturbed initial 
condition u(0) = 2+ a for small a € R. 


10.20 Consider the approximate solution of the initial value problem 
u' +100u= 100, u(0) =2, 


by the Euler method. Explain why for an accurate approximation the step size 
h has to be chosen smaller than h < 0.02 despite the fact that the solution is 
almost constant for x not too small, say, for x > 0.1. (This differential equation 
is an example of a so-called stiff equation, for which the numerical solution is 
rather delicate. ) 


11 


Boundary Value Problems 


Whereas in initial value problems the solution is determined by conditions 
imposed at one point only, boundary value problems for ordinary differ- 
ential equations are problems in which the solution is required to satisfy 
conditions at more than one point, usually at the two endpoints of the 
interval in which the solution is to be found. Since an ordinary differential 
equation of order n has, in principle, a general solution depending on n pa- 
rameters, the total number of boundary conditions required to determine 
a unique solution is n. For an introduction to some of the basic methods 
for the numerical solution of such boundary value problems we shall con- 
fine ourselves to the simplest boundary value problem, which is one for 
an equation of the second order in which the solution is specified at two 
distinct points. For more detailed studies we refer to [13, 36, 46]. 

As opposed to the fundamental Picard—Lindelof existence and uniqueness 
theorem for initial value problems, a detailed analysis of the existence and 
uniqueness theory for nonlinear boundary value problems is more involved 
and beyond the scope of this introduction. However, for linear boundary 
value problems the theory is more elementary, and we shall include part of 
it in our analysis. 

For the numerical solution of boundary value problems for ordinary dif- 
ferential equations three different groups of methods are available: shooting 
methods, finite difference methods, and finite element methods. Whereas 
shooting methods, which we briefly describe in Section 11.1 and which rely 
on numerical methods for initial value problems, are restricted to ordinary 
differential equations, the finite difference and finite element methods can 
also be applied to boundary value problems for partial differential equa- 
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tions. Therefore, our presentation of finite difference and finite element 
methods for linear ordinary differential equations is also meant as a model 
discussion for the more complicated and more important case of partial 
differential equations. 

Of course, in one chapter only a small part of the theory and the ap- 
plications of finite difference and finite element methods can be covered. 
Hence, we set ourselves the task to outline the basic ideas of these meth- 
ods by considering only the simplest cases. For a solid foundation of the 
finite element method, we felt it was necessary to include as its theoretical 
basis a discussion of the Galerkin method for strictly coercive operators, 
which appears in Section 11.3. This, in turn, made it necessary to present 
the Lax—Milgram theorem on the existence of solutions for equations with 
strictly coercive operators. 


11.1 Shooting Methods 


Consider the boundary value problem for the differential equation of the 


second order 
u'=f(z,u,u'), a<ax<b, (11.1) 


with boundary conditions 
u(a) =a, u(b) = 8. (11.2) 


For the sake of simplicity we assume that the function f is defined on 
[a,b] x IR?. 

Shooting methods attempt to employ the numerical methods described 
in the previous chapter for initial value problems where, roughly speaking, 
the initial conditions at x = a are adjusted so that the solution satisfies the 
required boundary conditions (11.2). For this, in addition to the boundary 
value problem, we also consider the initial value problem 


u" = f(z,u,u'), ula)=a, wl(a)=s, (11.3) 


with a real parameter s. Geometrically speaking, the parameter s prescribes 
the initial slope of the solution curve. 

If we assume that f is continuous and satisfies a Lipschitz condition 
with respect to u and u’, then by the Picard—Lindelof Theorem 10.5, for 
each s € IR there exists a unique solution u(-, s) of the initial value problem 
(11.3). To arrive at a solution to the boundary value problem (11.1)—(11.2), 
the parameter s has to be chosen such that u(b,s) = @; i.e., we have to 
solve the equation 

F(s) = 0, 
where the function F : IR > R is defined by 


F(s) := u(b,s) — B. 
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For each s the value F'(s) can be computed approximately by one of the 
numerical methods of Chapter 10 for the solution of initial value problems, 
extended appropriately to the case of a second-order equation. Note that 
for a nonlinear differential equation the equation F'(s) = 0 is nonlinear. 
For finding a zero of F the Newton method of Section 6.2 can be em- 
ployed. For the computation of the derivative F’(s), which is required for 
Newton’s method, we assume that the solution u to the initial value prob- 
lem (11.3) depends in a continuously differentiable manner on the parame- 
ter s. This can be assured by appropriate assumptions on f (see [12]). We 


set 
_ Ou 
~ Os 
and differentiate the differential equation and the initial condition (11.3) 
with respect to s to obtain 


Vv: 


uv" (2,8) = fu(z, u(a, s),u' (2, s))v(z, s) 
(11.4) 
+ fy (xz, u(x, s),u’ (x, s))v' (a, s) 
and 
v(a,s)=0, v'(a,s) =1. (11.5) 
Since 
F"(s) = v(6,s), 


computing the derivative of F' requires solving the additional linear initial 
value problem (11.4)—(11.5) for v, where u is known from solving (11.3). 
Note that from a numerical approximation, u is known only at grid points. 
Summarizing, we obtain the following method. 


Algorithm 11.1 The shooting method with Newton iterations consists of 
the following steps: 

1. Choose an initial slope s € R. 

2. Solve numerically the initial value problem for 


u" = f(az,u,u’) 
with initial conditions u(a) = a, u'(a) = s and the initial value problem for 
v’ = fu(z,u,u’)u + fu(z,u,u')v' 


with initial conditions v(a) = 0, v'(a) = 1. 
3. If u(b) = GB ts satisfied within the required accuracy, then stop; otherwise, 
replace s by 

_ u(b) — B 


> 08) 


and go back to step 2. 
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Example 11.2 Consider the boundary value problem 


with the exact solution u(x) = /2/x. We solve numerically the associated 
initial value problem 


u" =u, u(l)=Vv2, w(1)=s, 


by the improved Euler method of Section 10.2 with step sizes h = 0.1, 
h = 0.01, and h = 0.001. For this we transform the initial value problem 
for the equation of second order into the initial value problem for the system 


uv=w, w =u, u(l)=Vv2, w(l)=s. 


As starting value for the Newton iteration we choose s = 0. The exact initial 


condition is s = —/2 = —1.414214. The numerical results represented in 
Table 11.1 illustrate the feasibility of the shooting method with Newton 
iterations. O 


TABLE 11.1. Numerical results for Example 11.2. 


0.00000 | 3.61648 0.00000 | 3.84079 0.00000 | 3.84400 
—0.81116 | 1.10056 —0.74681 | 1.26284 —0.74584 | 1.26538 


—1.31684 | 0.15879 —1.28234 | 0.21124 —1.28180 | 0.21210 
—1.41553 | 0.00373 —1.40987 | 0.00678 —1.40980 | 0.00684 
—1.41796 | 0.00000 —1.41424 | 0.00000 —1.41420 | 0.00000 
—1.41796 | 0.00000 —1.41424 | 0.00000 —1.41421 | 0.00000 


Numerical problems with ill-conditioning will arise in cases where small 
changes in the initial data s will cause large changes in the solution u(., s). 
This is illustrated by the following example. 


Example 11.3 The linear boundary value problem 
vu” —u'—110u=0, wu(0) = u(10) = 1, 
has the unique solution 


1 


Tonio {(e — Yew + 1 — eM )et*}. 


u(x) = 
The unique solution to the associated initial value problem with initial 
conditions u(0) = 1 and u’(0) = s is given by 


11 — 10 + 
u(x) _ = e 10x 4 = elit 
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Hence, in this case we have 


21 21 it. 


F(s) = 


From F'(s) = 0 we deduce that the exact initial slope s satisfies 


—110 _ ,-210 


e€ 
~10<s=—-10+ 21 4a - 


In a numerical computation with ten-decimal-digit accuracy the best ap- 
proximation § to the exact zero s we can expect is such that 


-10<5<-10+107”. 
Within this interval of initial conditions we now have 
u(10,-10) =e'™ x0 
and 


21 


110 37 
~2.8-10°': 
51 e€ 8-10"; 


u(10,-10 + 107°) = 
i.e., small changes in s will cause very large changes in the values of the 
solution at the other endpoint. Hence, we cannot expect that this bound- 
ary value problem can be numerically solved by the simple version of the 
shooting method. O 


This difficulty can be remedied by a multiple shooting method as follows. 
The interval [a, 6] is subdivided into n subintervals according to 


A=% <4 <-+++ < In_-1 < In = O. 


Then for given vectors u = (ug,...,Un—1)? and s = (89,...,8n—1)* in IR” 
such that up = a, for 7 = 0,...,n —1 consider the n initial value problems 
for 


u' = f(z,u,u') 


on the subintervals [z;,2;41] with initial conditions 
(xj) = uj, w(x) = 83. 


In order to obtain from this a solution to the differential equation on all 
of the interval [a, 6], the solutions u(-,u;,s;) on the subintervals [z;, 7541] 
have to coincide at the grid points 21,...,Z%n—1 together with their first 
derivatives. Then the differential equation ensures that the function is twice 
continuously differentiable on [a,b]. In addition, the boundary condition 
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u(b) = (2 must be satisfied. Altogether we have the following 2n—1 nonlinear 


equations for the 2n — 1 unknowns wj,...,Un_—1 and Sg,...,Sn—1: 
u(rj41,U;,8;) —Uj41 = 0, J =0,...,n—2, 
/ — _ 
u'(%541,U;,8;) — $741 = 0, JG =0,...,n— 2, (11.6) 


U(In,Un—1, $n—1) — B = O. 


For the solution of this system Newton’s method can again be used. For 
details we refer to [36, 50]. 


11.2 Finite Difference Methods 


As already indicated in Example 2.1, the basic idea of finite difference 
methods for the approximate solution of boundary value problems consists 
in replacing the derivatives in the differential equations by difference quo- 
tients. For the sake of simplicity, we confine our presentation to a linear 
boundary value problem. Without loss of generality we need consider only 
the homogeneous boundary condition, since inhomogeneous boundary con- 
ditions can be dealt with by incorporating them into the right-hand side of 
the differential equation (see Problem 11.3). 


Theorem 11.4 Assume that q,r € C[a,b] and q > 0. Then the boundary 
value problem for the linear differential equation 


—u"+qu=r_ on[a,)] (11.7) 
with homogeneous boundary conditions 
u(a) = u(b) = 0 (11.8) 
has a unique solution u € C?{a, b]. 


Proof. Assume that u; and ug are two solutions to the boundary value 
problem. Then the difference u = u; — ug solves the homogeneous boundary 
value problem 

—u"+qu=0, u(a) = u(b) =0. 


By partial integration we obtain 


b b 
/ ([u’]? + qu’) dx = / (—u" + qu)udz = 0. 


This implies u’ = 0 on [a,b], since gq > 0. Hence u is constant on |a, 5}, 
and the boundary conditions finally yield u = 0 on [a,b]. Therefore, the 
boundary value problem (11.7)—(11.8) has at most one solution. 
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The general solution of the linear differential equation (11.7) is given by 
U= Cu + Cyu2 + u*, (11.9) 


where u;, Ug denotes a fundamental system of two linearly independent so- 
lutions to the homogeneous differential equation, u* is a solution to the in- 
homogeneous differential equation, and C'; and C2 are arbitrary constants. 
This can be seen with the help of the Picard—Lindeléf Theorem 10.1 (see 
Problem 11.4). The boundary condition (11.8) is satisfied, provided that 
the constants C’'; and C» solve the linear system 


Cu;(a) + Coue(a) = —u*(a), 


Cu; (6) + C2uU2(b) = —u*(b). 


This system is uniquely solvable. Assume that C’', and C2 solve the homoge- 
neous system. Then u = C)u; + Coug yields a solution to the homogeneous 
boundary value problem. Hence u = 0, since we have already established 
uniqueness for the boundary value problem. From this we conclude that 
C; = Cy = 0 because uw, and wz are linearly independent, and the exis- 
tence proof is complete. O 


For the numerical solution, proceeding as in Example 2.1, we choose an 
equidistant grid 
zj=at+gh, j3=0,...,n+1, 
with the step size given by h = (b—a)/(n +1) and n€ NN. At the internal 


grid points z;,7 = 1,...,n, we replace the differential quotient in the 
differential equation by the difference quotient 


wl" (a) & 5 [el ary1) — 2u(j) + (05-1) 


to obtain the system of equations 
1 ; 
72 [uj-1 —(2+hq;)uj tujyil=rj, j=l,...,n, (11.10) 
for approximate values u; to the exact solution u(z;). Here we have set 


qj := q(xz;) and r; := r(z;). The system has to be complemented by the 
two boundary conditions 


U0 = Unti = 0. (11.11) 
For an abbreviated notation we introduce the n x n tridiagonal matrix 
2+ 1 h? —] 
—1 2+ qo h? —] 
_ 1 —1 2+ q3h? —l 
A= B | | | 
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and the vectors U = (uj,...,Un)? and R = (rj,...,T)*. Then our system 
of equations, including the boundary conditions, reads 


AU = R. (11.12) 


The following two questions have to be answered: 
1. Is the system (11.12) uniquely solvable? 
2. How large is the error between the approximate solution u; and the 
exact solution u(z;)? Do we have convergence of the approximate 
solution to the exact solution as h > 0? 


Theorem 11.5 For each h > 0 the difference equations (11.10)—(11.11) 
have a unique solution. 


Proof. The tridiagonal matrix A is irreducible and weakly row-diagonally 
dominant. Hence, by Theorem 4.7, the matrix A is invertible, and the Ja- 
cobi iterations converge. O 


Recall that for speeding up the convergence of the Jacobi iterations we 
can use relaxation methods or multigrid methods as discussed in Sections 
4.2 and 4.3. 

The error and convergence analysis is initiated by first establishing the 
following two lemmas. 


Lemma 11.6 Denote by A the matrix of the finite difference method for 
q >0 and by Ao the corresponding matriz for q = 0. Then 


0<A!< A)’; 


i.e., all components of A~: are nonnegative and smaller than or equal to 
the corresponding components of Ag a 


Proof. The columns of the inverse A~! = (a1,...,@n) satisfy Aa; = e; for 
j = 1,...,n with the canonical unit vectors €,,...,@, in IR”. The Jacobi 
iterations for the solution of Az = e; starting with z = 0 are given by 


zv41 = —D7“'(AL + Ar)2v,+D‘'e;, v=0,1,..., 
with the usual splitting A = D+ A,+ Ap of A into its diagonal, lower, and 
upper triangular parts. Since the entries of D~! and of —D~'(Az, + Ar) 
are all nonnegative, it follows that A~! > 0. Analogously, the iterations 


y+ = —Dj*(Art + Ar)zv + Do'e;, y=0Q,l,..., 


yield the columns of Aj‘. Therefore, from Dj’ > D~' we conclude that 
Aj’ > Av. Oo 
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Lemma 11.7 Assume that u € C*[a, b]. Then 
"(x) — 4 ful + h) — 2u(2) + ute — m)]] < tl 
u"(z) — Fy [ule u(x) +u Sn oo 


for all x € [a+h,b-— Al. 


Proof. By Taylor’s formula we have that 


h? h? h4 
u(x +h) = u(x) + hu'(2) + Z u" (x) + e u(x) + oA u4) (2 + Osh) 


for some 6+ € (0,1). Adding these two equations gives 
h4 h4 
u(z+h)—2u(x2)+u(r2—h) = h’ul (a) +94 ul (2+04h)+ > u4) (¢—0_h), 


whence the statement of the lemma follows. 0 


Theorem 11.8 Assume that the solution to the boundary value problem 
(11.7)-(11.8) is four-times continuously differentiable. Then the error of 
the finite difference approximation can be estimated by 


h? , 
Jeary) — Ug] S ge Mu loo(b- a)’, f= 1,---5m. 
Proof. By Lemma 11.7, for 


zi =u" (253) - 3 [u(rj41) — 2u(aj) + u(aj-1)] 


we have the estimate 


h? 
zi] < Fy Mu lloo, FG = 1s-- om, (11.13) 
Since 
1 

aT) [u(2j341)—(2+h*q;)u(z;)+u(zj_-1)] = —u" (xj) +q;u(a;)+2; = rj+2;, 
the vector U = (u(z1),...,u(an))? given by the exact solution solves the 
linear system ; 

AU = R+Z, 
where Z = (z1,...,2n)!. Therefore, 

A(U —U) =Z, 


and from this, using Lemma 11.6 and the estimate (11.13), we obtain 


_ h? _ , 
ju(zj) — uj] < |A~* Z| loo < Dp Ju |Jool]Ao ello, J =1,-.-,m, (11.14) 
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where e = (1,...,1)’. The boundary value problem 
—Ug = I, uo(a) — uo (0) — 0, 


has the solution 


uo (x) = ; (x — a)(b— 2). 


= 0, in this case, as a consequence of (11.14) the finite difference 


~ 


Since ut) 
approximation coincides with the exact solution; i.e., e = AgU = AoU. 
Hence, 


_ 1 , 
AZ *elloo < [luolloo = 5 (b-a)®, f= 1,...4m. 


Inserting this into (11.14) completes the proof. O 


Theorem 11.8 confirms that as in the case of the initial value problems 
in Chapter 10, the order of the local discretization error is inherited by 
the global error. Note that the assumption in Theorem 11.8 on the dif- 
ferentiability of the solution is satisfied if g and r are twice continuously 
differentiable. 

The error estimate in Theorem 11.8 is not practical in general, since it 
requires a bound on the fourth derivative of the unknown exact solution. 
Therefore, in practice, analogously to (10.19) the error is estimated from 
the numerical results for step sizes h and h/2. Similarly, as in (10.20), a 
Richardson extrapolation can be employed to obtain a fourth-order approx- 
imation. 

Of course, the finite difference approximation can be extended to the 
general linear ordinary differential equation of second order 


—u"+pu'+qu=r 


by using the approximation 


ul (03) & 5 [ularys1) — w(aj-) (11.15) 


for the first derivative. This approximation again has an error of order 
O(h?) (see Problem 11.9). Besides Richardson extrapolation, higher-order 
approximations can be obtained by using higher-order difference approxi- 
mations for the derivatives such as 


——|-—u(x + 2h l6u(z +h 
ronal Me + 2h) + Boule +) (11.16) 


—30u(x) + 16u(2 — h) — u(x — 2h)], 


u(x) & 


which is of order O(h*), provided that u is six-times continuously differen- 
tiable (see Problem 11.9). 
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We wish also to indicate briefly how the finite difference approximations 
are applied to boundary value problems for partial differential equations. 
For this we consider the boundary value problem for 


—~Aut+qu=r inD (11.17) 
in the unit square D = (0,1) x (0,1) with boundary condition 
u=0 ondOD. (11.18) 
Here A denotes the Laplacian 
2 2 
Au := iat iat ; 


Proceeding as in the proof of Theorem 11.4, by partial integration it can 
be seen that under the assumption gq > 0 this boundary value problem 
has at most one solution. It is more involved and beyond the scope of this 
book to establish that a solution exists under proper assumptions on the 
functions q and r. We refer to [24, 60] and also the remarks at the end of 
Section 11.4. 

As in Example 2.2, we choose an equidistant grid 


tig = (th, jh), 1,7 =0,...,n+1, 


with step size h = 1/(n+1) and n € NN. Then we approximate the Laplacian 
at the internal grid points by 


1 
Au(2ij) © 2 


and obtain the system of equations 


{u(rit1,j) + u(ei—1,5) + U(ei,j41) + u(aij—-1) — 4u(ziz)} 


1 
pa U4 + A’ gig uss — Wita,g — Wig — Yaga — Wi j—a] = Pi, 
(11.19) 


1g =1,...,Nn, 


for approximate values u;; to the exact solution u(z;;). Here we have set 
Qij = Q(@iz) and rj; := r(a;;). This system has to be complemented by the 
boundary conditions 


Uo,j = Unt1,j = 9, j = 90,...,n +1, 
(11.20) 
Uio = Ui,n+1 = OQ, a=1,...,n. 


We refrain from rewriting the system (11.19)-(11.20) in matrix notation 
and refer back to Example 2.2. Analogously to Theorem 11.5, it can be seen 
that the Jacobi iterations converge (and relaxation methods and multigrid 
methods are applicable). Hence we have the following theorem. 
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Theorem 11.9 For each h > 0 the difference equations (11.19)-(11.20) 
have a unique solution. 


From the proof of Lemma 11.6 it can be seen that its statement also holds 
for the corresponding matrices of the system (11.19)—(11.20). Lemma 11.7 
implies that 


1 
Au(2x1, 22) — 7m) [u(r +h, 22) + u(x, —h, 22) + u(a1, 22 +h) 


a 


provided that u € C4([0, 1] x [0,1]). Then we can proceed as in the proof 
of Theorem 11.8 to derive an error estimate. For this we need to have an 
estimate on the solution of 


Ofu 


4 
Or5 


Otu 


h2 
+u(21,22 — h) — 4u(21, x2) —— 


< 
~ 12 


©, @) 


—Auo =1 in D, ug = 0 on OD. (11.21) 


Either from an explicit form of the solution obtained by separation of vari- 
ables or by writing 


1 1 
uo(r) = ri (1 —2,)a, + A (1 — 22)x%eo + vo(z), 
where Uo is a harmonic function, i.e., a solution of Avo = 0, and employing 
the maximum minimum principle for harmonic functions (see [39]), it can 
be seen that |/uollo < 1/8 (see Problem 11.10). Hence we can state the 
following theorem. 


Theorem 11.10 Assume that the solution to the boundary value problem 
(11.17)-(11.18) 1s four-times continuously differentiable. Then the error of 
the finite difference approximation can be estimated by 


an 
Ox 


Ofu 


h2 
ju(xiz) — wiz] < 96 | dat 


| 1,9 =1,...,n. 


CoO 


11.3 The Riesz and Lax—Milgram Theorems 


To establish the foundation of finite element methods for boundary value 
problems we need to extend our tools from functional analysis. 


Theorem 11.11 (Riesz) Let X be a Hilbert space. Then for each bounded 
linear function F : X — C there exists a unique element f € X such that 


F(u) = (u, f) (11.22) 
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for allu € X. The norms of the element f and the linear function F 
coincide; 1.e., 
Ff ll = (FI (11.23) 


Proof. Uniqueness follows from the observation that because of the positive 
definiteness of the scalar product, f = 0 is the only element representing 
the zero function F' = 0 in the sense of (11.22). For F # 0 choose w € X 
with F(w) 4 0. Since F is continuous, the nullspace 


N(F) = {we X : F(u) = 0} 


can be seen to be a closed, and consequently, by Remark 3.40, a complete, 
subspace of the Hilbert space X. By the approximation ‘Theorem 3.52 there 
exists the best approximation v to w with respect to N(F’). By Theorem 
3.51 it satisfies w —v L N(F). Then for g := w — v we have that 


(F(g)u—F(u)g,g)=0, we X, 


since F(g)u — F(u)g € N(F) for all u € X. Hence, 


F(u) = (. wee) 


 |igll? 


for all u € X, which completes the proof of (11.22). 
From (11.22) and the Cauchy—Schwarz inequality we have that 


IF(u)| <[Ffllflull, we x, 


whence ||F'|| < ||f|| follows. On the other hand, inserting f into (11.22) 
yields 
WFP = FCA) < FINS, 


and therefore ||f|| < ||F'||. This concludes the proof of the norm equality 
(11.23). O 


Definition 11.12 A linear operator A: X — X in a pre-Hilbert space X 
is called strictly coercive if there exists a constant c > 0 such that 


Re(Au, u) > e|lul|? (11.24) 
for allue X. 


Theorem 11.13 (Lax—Milgram) In a Hilbert space X a bounded and 
strictly coercive linear operator A : X — X has a bounded inverse 
AT:X +X. 


Proof. Using the Cauchy—Schwarz inequality, we can estimate 


|| Aull |lul| > Re(Au, u) > ellull’. 
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Hence 
{| Aul] > clle|| (11.25) 


for all u € X. From (11.25) we observe that Au = 0 implies u = 0; i.e., A 
is injective. 

Next we show that the range A(X ) is closed. Let v be an element of the 
closure A(X) and let (v,,) be a sequence from A(X) with v, > v, n > oo. 
Then we can write v, = Aun with some u, € X, and from (11.25) we find 
that 


cllun — Um|| < lun — Um 


for alln, m € IN. Therefore, (u,,) is a Cauchy sequence in X and converges: 
Un > U, n — oo, with some u € X. Then v = Au, since A is continuous, 


and A(X) = A(X) is proven. 

From Remark 3.40 we now have that A(X) is complete. Let w € X be 
arbitrary and denote by v its best approximation with respect to A(X), 
which uniquely exists by Theorem 3.52. Then, by Theorem 3.51, we have 
(w—v,u) = 0 for all u € A(X). In particular, (w—v, A(w —v)) = 0. Hence, 
from (11.24) we see that w = v € A(X). Therefore, A is surjective. Finally, 
the boundedness of the inverse 


|A~*II < - (11.26) 


is a consequence of (11.25). O 


Definition 11.14 Let X be a complezx (or real) linear space. Then a func- 
tion S: X x X + € (or R) ts called sesquilinear if it is linear with respect 
to the first variable and antilinear with respect to the second variable, 1.e., 


af 
S(au + Bv,w) = aS(u,w) + BS(v, w) 


and 


S(u, av + Bw) = &S(u,v) + BS(u, w) 


for all u,v,w € X anda,f € C (or R). A sesquilinear function on a 
normed space X is called bounded if 


|S(u,v)| < Cllull [el] 


for all u,v € X and some positive constant C’. It is called strictly coercive 
if 
Re S(u, u) > cllul|? 


for allu € X and some positive constant c. 


Note that for a real linear space, sesquilinear functions are bilinear func- 
tions, i.e., linear with respect to both variables. Each bounded and strictly 
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coercive linear operator A : X — X in a pre-Hilbert space defines a 
bounded and strictly coercive sesquilinear function by 


S(u,v):=(u,Av), u,veEe X. 
The converse of this statement is described by the following theorem. 


Theorem 11.15 Let S be a bounded and strictly coercive sesquilinear func- 
tion on a Hilbert space X. Then there exists a uniquely determined bounded 
and strictly coercive linear operator A: X — X such that 


S(u,v) = (u, Av) 
for allu,v EX. 


Proof. For each v € X the mapping u+> S(u,v) clearly defines a bounded 
linear function on X, since |S(u,v)| < C|lul| ||u||. By the Riesz Theorem 
11.11 we can write S(u,v) = (u,f) for all u € X and some f € X. 
Therefore, setting Av := f we define an operator A: X — X such that 
S(u,v) = (u, Av) for all u,v € X. 

To show that A is linear we observe that 


(u,aAv + BAw) = G(u, Av) + B(u, Aw) = GS(u,v) + BS(u, w) 


= S(u,av + Bw) = (u, Alav + Bw) 
for all u,v,w € X and all a, @ € C. The boundedness of A follows from 
|| Aul|? = (Au, Au) = S(Au, uv) < C|]Aul| [Jul], 


and the strict coercivity of A is a consequence of the strict coercivity of S. 
To show uniqueness of the operator A we suppose that there exist two 
operators A, and Ag with the property 


S(u,v) = (u, Ayv) = (u, Aov) 


for all u,v € X. Then we have (u, Ayu — Agu) = 0 for all u,v € X, which 
implies A,v = Agu for all v € X by setting u = A,v — Agu. O 


Corollary 11.16 Let S be a bounded and strictly coercive sesquilinear 
function and F a bounded linear function on a Hilbert space X. Then there 
exists a unique u € X such that 


S(v,u) = F(v) (11.27) 
forallvue X. 


Proof. By Theorem 11.15 there exists a uniquely determined bounded and 
strictly coercive linear operator A such that 


S(v,u) = (v, Au) 
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for all u,v € X, and by Theorem 11.11 there exists a uniquely determined 
element f such that 


F(v) = (v, f) 
for all v € X. Hence, the equation (11.27) is equivalent to the equation 


Au = f. 


However, the latter equation is uniquely solvable as a consequence of the 
Lax—Milgram Theorem 11.13. oO 


Since the coercivity constants for A and S coincide, from (11.23) and 
(11.26) we conclude that 


1 
lull < = IF (11.28) 


for the unique solution u of (11.27). 

Let A: X — X be a bounded linear operator. Then, given f € X, 
solving the equation Au = f obviously is equivalent to finding u € X such 
that 

(v, Au) = (v, f) (11.29) 


for all v € X. The Galerkin method, named after the Russian engineer 
Galerkin, is based on this observation, and given a finite-dimensional sub- 
space X, C X, it approximately solves (11.29) by an element un € Xn 
such that 

(vu, Aun) = (v, f) (11.30) 


for all v € X,,. By Theorems 3.51 and 3.52, the condition (11.30) is equiva- 
lent to the fact that the best approximations to Au, and to f with respect 
to X, coincide; i.e., 

P, Aun = Prf, (11.31) 


where P,, denotes the orthogonal projection operator from X onto Xp. 
The equivalence of (11.30) and (11.31) is the reason why the Galerkin 
method belongs to the so-called projection methods; i.e., the equation to be 
approximated is projected onto a finite-dimensional subspace. 

To analyze the Galerkin method we introduce a finite-dimensional oper- 
ator A, : Xn > Xn by An := P,A. Then, by Theorem 3.51, we have 


(Anu, u) = (P, Au, u) = (Au, u) + (P, Au — Au, u) = (Au, u) 
for all u € X,,. Hence from the strict coercivity of A we deduce that 
Re(Anu, u) > ellull* 


for all u € Xp; ie., An : Xn - Xn is strictly coercive with the same 
coercitivity constant c as A: X — X. This now can be employed to prove 
the following theorem. 
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Theorem 11.17 For a bounded and strictly coercive linear operator A the 
Galerkin equations (11.30) have a unique solution. It satisfies the error 


estimate 
lun — ul| < M inf |lv — ull, (11.32) 
VEXn 


where M is some constant depending on A (and not on Xj). 


Proof. Since Ay, : Xn > Xn is strictly coercive with coercitivity constant 
c, by the Lax—Milgram Theorem 11.13 we conclude that Ap is bijective; 
i.e., the Galerkin equations (11.30) have a unique solution u, € Xp. The 
estimate (11.26) applied to the operator A, implies that 


IAs <=. (11.33) 


For the error u, — u between the Galerkin approximation u, and the 
exact solution u we can write 


Un —u=(Az'P,A—I)u= (Az'P,A — I)(u — v) 


for all v € Xn, since, trivially, we have A;'P,Av = v for v € Xn. By 
Theorem 3.52 we have ||P,,|| = 1, and therefore, using Remark 3.25 and 
(11.33) we can estimate 


_ 1 
A Pa All < = [All 
whence (11.32) follows. O 


The error estimate of Theorem 11.17 is usually referred to as Céa’s 
lemma, since it was first obtained by Céa in 1964. It indicates that the 
error in the Galerkin method is determined by how well the exact solution 
can be approximated by elements of the subspace Xn. 

By Corollary 11.16 the Galerkin method immediately carries over to the 
solution of the sesquilinear equation (11.27) and consists in finding u, € Xn 
such that 

S(v, un) = F(v) (11.34) 


for allvu € Xp. 

The practical solution of the Galerkin equations (11.30) reduces to the 
solution of a system of linear equations. If w ,,...,w, is a basis for X, 
(without loss of generality we assume the dimension of X, to be n), then 


for 
Tt 
Un = S ALWE 
k=1 


the Galerkin equations (11.30) are equivalent to the system of linear equa- 
tions 


S| ax (wj, Awe) = (wy, f), g=l,....n. (11.35) 


k=1 
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From this formulation it becomes obvious that the Galerkin method is 
only a semidiscrete method, since setting up the linear system requires 
the evaluation of scalar products and of the operator A applied to the 
basis elements. For a fully discrete method these computations, in general, 
need further approximations of integrals for the scalar products and of 
differential or integral operators. This also requires that the error analysis 
be amended accordingly, since the error estimate of Theorem 11.17 covers 
only the semidiscrete case. 

Having outlined the basic ideas of the Galerkin method and its error 
analysis within a few paragraphs, we want to point out clearly that the 
power and the art of the application of the Galerkin method for the ap- 
proximate solution of differential and integral equations begins with the 
proper choice of the approximating subspace X,, and the appropriate basis 
W1,..-,Wy therein, corresponding to the operator A under consideration. 
However, it is beyond our goal to enter into this important topic in any 
detail aside from the short discussion in Section 11.5. 


11.4 Weak Solutions 


We return to the boundary value problem, and instead of (11.7)—(11.8) we 
consider the slightly more general so-called Sturm—Liouville problem 


—(pu')’+qu=r_ in [a,}] (11.36) 
with homogeneous boundary conditions 
u(a) = u(b) = 0. (11.37) 


Here we assume that p € C'[a,b] and q, r € C[a,6] such that p(x) > 0 
and q(x) > 0 for all « € [a,b]. Multiplying the differential equation by 
v and performing a partial integration, it follows that each solution u to 
(11.36)—(11.37) satisfies 


S(v,u) = F(v) (11.38) 
for all v € C'fa, b] with v(a) = v(b) = 0, where we have set 
b 
S(u,v) := / (pu'v’ + quv) dx (11.39) 
and , 
F(v) := / rv dz. (11.40) 


Conversely, if u € C?[a, b] satisfies (11.38), by partial integration we obtain 
that 


b 
/ [(pu’)' — qu+ r]udz =0 (11.41) 
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for all v € C'[a, 6] with v(a) = v(b) = 0. Now we set f := (pu')’ —qu+r 
and assume that f(r) #4 0 for some zg in (a,b), say f(xo) > 0. Since f is 
continuous, there exists an interval U C (a,b) such that f is positive on U. 
Now we choose a nonnegative function v # 0 from C'[a, 6] which vanishes 
outside U. For this function v the integral in (11.41) must be positive. This 
is a contradiction, and therefore f must vanish identically; i.e., u satisfies 
the differential equation (11.36). Therefore, (11.38) provides an equivalent 
reformulation of the boundary value problem. 

From Example 3.38 we recall that the space of continuous functions is 
not complete with respect to the Lz scalar product. However, if we wish 
to apply the analysis of the previous section and, in particular, Corollary 
11.16, then we need a Hilbert space. For this, we introduce the Sobolev space 
H'[a, b] based on the concept of weak derivatives. By L*(a, b] we denote the 
space of measurable real-valued functions defined on the interval [a, b] that 
are square-integrable in the sense of Lebesgue. We shall make use of the 
fact that L?[a, 6] is a Hilbert space with respect to the Lz scalar product. 
(More precisely, L?[a, 5] is the linear space of equivalence classes of functions 
coinciding almost everywhere.) Note that the space Cla, b] of continuous 
functions is dense in L?{a, b] (see [5, 51, 59]). 


Definition 11.18 A function u € L?[a, }] is said to have a weak derivative 


u’ € L?[a, b} if 
b b 
/ uv dz = -[ u'v dx (11.42) 


for all v € C'[a, b] with v(a) = v(b) = 0. 


By partial integration it follows that (11.42) is satisfied for functions 
u € C'[a,b]. Hence, weak differentiability generalizes classical differentia- 
bility. 

From the denseness of {v € C'{a, }] : v(a) = v(b) = 0} in L?[a, b], or from 
the Fourier series for the odd extension of u, it can be seen that the weak 
derivative, if it exists, is unique (see Problem 11.17). From the denseness of 
Ca, b] in L?[a, b], or from the Fourier series for the even extension of u, it 
follows that each function with vanishing weak derivative must be constant 
almost everywhere (see Problem 11.17). The latter, in particular, implies 


u(x) = [ u'(€)d€+c (11.43) 


for almost all x € [a,b] and some constant c, since by Fubini’s theorem 


/ (| eae) oe) ae = | ‘we ( I “v'(a) ts) a 


b 
=- / u'()o(€) dé 
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for all v € C'[a,b] with v(a) = v(b) = 0. Hence both sides of (11.43) have 
the same weak derivative. 


Theorem 11.19 The linear space 
H"[a, b] := {u € L?[a, b] : u' € L*[a, b}} 


endowed with the scalar product 


b 
(u,v);n i= / (uv + u'v’) de (11.44) 


1s a Hilbert space. 


Proof. It is readily checked that H'[a, b] is a linear space and that (11.44) 
defines a scalar product. Let (u,) denote an H'! Cauchy sequence. Then 
(un) and (u!,) are both L* Cauchy sequences. From the completeness of 
L?\a, b] we obtain the existence of u € L?[a,b] and w € L?[a, 6b] such that 
Jun — ull2 > 0 and |\us, — wile > 0 as n > ov. Then for all v € Ca, }] 
with v(a) = v(b) = 0 we can estimate 


b b 
/ (uv + wv) dx = / {(u—un)v' + (w — uj,)v} dr 
a a 
< |lu — Unlfrallo'llz2 + [lw — unllzellollcz +0, 2 > oo. 


Therefore, u € H'[a, 6] with u’ = w, and ||u — upl|y: 3 0, n > 00, which 
completes the proof. O 


Theorem 11.20 C'[a, 6] is dense in H'[a, 6]. 


Proof. Since C{a, b] is dense in L?{a, b], for each u € H'[a, b] and e > 0 there 
exists w € Cla, 6] such that ||u’ — w|l2 < ¢. Then we define v € C’[a, b] by 


v(a) = u(a) + | * w(E) dé, 


a 


and using (11.43), we have 
u(x) — v(x =f {u’(€) — w(E)} dé. 


By the Cauchy—Schwarz inequality this implies |]u — v|lo < (b — a)e, and 
the proof is complete. O 


Theorem 11.21 H'[a,}] is contained in C{a, b]. 
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Proof. From (11.43) we have 
u(x) — uly) = | wa (11.45) 
y 
whence by the Cauchy—Schwarz inequality, 


ju(ax) — u(y)| < |x — y|?/?|lu'lle 


follows for all x, y € [a,b]. Therefore, every function u € H'[a, b] belongs to 
Cla, b], or more precisely, it coincides almost everywhere with a continuous 
function. O 


By Theorem 11.21 we may consider H'{a,b] as a subspace of C|a, 8]. 
Choose y € [a,b] such that |u(y)| = ming<z<s ju(z)|. Then from 


(b — a) min n,, u(x) I? <f u(x) |* dz 
and (11.45), by the Cauchy—Schwarz inequality we find that 


HFtlloo < Cull 


for some constant C. The latter inequality means that the H! norm is 
stronger than the maximum norm (in one space dimension!). 


Theorem 11.22 The space 
H5[a, 6] := {u € H"[a, 6] : u(a) = u(b) = 0} 
is a complete subspace of H'{a, 6]. 


Proof. Since the H! norm is stronger than the maximum norm, each H! 
convergent sequence of elements of Hé[a, b] has its limit in H4{a, 6]. There- 
fore H}[a, b] is a closed subspace of H'[a, 6], and the statement follows from 
Remark 3.40. O 


Definition 11.23 A function u € Hj[a,b| ts called a weak solution to 
the boundary value problem (11.36)-(11.87) if (11.38) is satisfied for all 
v € Hala, b}. 


Theorem 11.24 Assume that p > 0 andq > 0. Then there exists a unique 
weak solution to the boundary value problem (11.36)—(11.37). 


Proof. The sesquilinear function S : Hé[a,b] x Hd[a, 6] is bounded, since 


}S(u, v)| < max {||plloo5 [Iqlloo } [ell zz [lol] a 
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by the Cauchy—Schwarz inequality. For u € Hd{a, |, from (11.45) and the 
Cauchy—Schwarz inequality we obtain that 


wultis= [| [ weac 


Hence we can estimate 


2 
dx < (b—a)?||u'||Z2. 


b 
S(u,u) > min p(x) / lu'|2 dx > eljull2), 
a<z<b a 


for all u € Hé[a, b] and some positive constant c; i.e., S is strictly coercive. 
Finally, by the Cauchy—Schwarz inequality we have 


Fv) < Wrllcallollc2 < Welle loll; 


i.e., the linear function F : Hj[a,b] > IR is bounded. Now the statement 
of the theorem follows from Corollary 11.16. O 


We note that from (11.28) and the previous inequality it follows that 
1 
elle <= Ure: (11.46) 


for the weak solution u to the boundary value problem (11.36)—(11.37). 


Theorem 11.25 Each weak solution to the boundary value problem (11.36)- 
(11.37) is also a classical solution; t.e., it ts twice continuously differen- 
tiable. 


Proof. Define 
fa) = f tales) = r(@))dg, 2 € [a8 
Then f € C1[a,b]. From (11.38), by partial integration we obtain 
[iw ~ flv’ dx =0 


for all v € Hj[a, b]. Now we set 


b 
5 | Ww - flag 


Ci= 


and 


vo(2) = | @w© -F@) -da, 2 € (a,b) 
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Then vo € H6[a, b] and 


b b 
[ out 4 -ePae = f fpw - f - evs ae 


b b 
= [ ww - faze | Up dx = 0. 


Hence 

pu =f+e, 
and since f and p are in C'[a,b] with p(x) > 0 for all x € [a, 6], we can 
conclude that u’ € C'[a, 6] and 


(pu')' = fi =qu—r. 
This completes the proof. O 


Using the differential equation (11.36), from (11.46) we conclude that 
there exists a constant C' > 0, independent of r, such that 


lle" In2 < Cllriize, (11.47) 


which we note for later use. 

As compared to Theorem 11.4 we have not obtained any major extension 
of the existence result. However, as pointed out already in the introduction, 
we view this section as a model case for the more complicated situation of 
partial differential equations. By partial integration it can be seen that 
the boundary value problem (11.17)—(11.18) for the Laplace operator is 
equivalent to finding a function u € C?(D) satisfying u = 0 on OD and 


/ ({grad ul’ gradv + quv)dz = / fudz (11.48) 
D D 


for all v € C'(D) with v = 0 on OD. The analysis of weak solutions 
of (11.48) follows the same pattern as for the ordinary differential equa- 
tion (11.36). However, the details are more heavily involved. In particular, 
since for the multidimensional case the Sobolev space H!(D) no longer is a 
subspace of the continuous functions, the formulation of the boundary con- 
dition, i.e., the definition of the subspace Hj(D), has to be modified, and 
establishing that weak solutions are also classical solutions is more com- 
plicated. For a comprehensive study of weak solutions to boundary value 
problems for elliptic partial differential equations we refer to [24, 60]. 


11.5 The Finite Element Method 


The finite element method for the boundary value problem (11.36)—(11.37) 
consists in the application of the Galerkin method (11.34) to the weak 
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formulation (11.38) by using spline spaces as approximating subspaces. 
Then, for appropriate basis functions, the matrix S(w,;,w,) will be sparse; 
i.e., most of the matrix entries will be zero. Polynomials as approximating 
subspaces are not suitable, since analogously to Example 5.1, they lead to 
ill-conditioned linear systems with full matrices. 

We consider the case of linear splines. For the equidistant grid 


gji=atgjh, 73 =0,...,n+1, 


with step size h = (b—a)/(n +1) and n € IN we choose for X,, the space 
of continuous piecewise linear functions; i.e., X, consists of the functions 
u € Cla, b] that satisfy u(a) = u(b) = 0 and coincide on each subinterval 
[x;-1,2;] with a polynomial in P, for 7 = 1,...,n. The functions in this 
spline space belong to Hé{a,b] with piecewise constant weak derivatives. 
As basis elements in X,, we take the so-called hat functions 


1 

A (o—2p-1), -& E [wp_1, Ze], 
w(x) := 4 | : 

i (te41—-2), LE (tp, Teyi], 

0, x ¢ [Th-1, Te+1]. 


Each u € X, can be represented in the form 


Tr 
u=) ARWk, 
k=1 


where a, = u(rz), k = 1,...,n. Obviously, we have 


b 
S(wj,we) = | {pw;,w), + qujw,}dr = 0 


if (wj-1, %j41) N (@e-1, 2e41) = Y, ie., if |j — k| > 2. Therefore, the matrix 
S(w;,w,) is tridiagonal. We compute the matrix elements 


1 tit 
S(wj,w;) = 5 pla) da 


Lj—1 


+ / © gla) = )-1)de + f OP aalajs oae| 


j-1 j 
and 


S(w;,wj4i) = S(wy41, W;) 


—] Lij+1 1 Lj+i1 
he / p(x) dx + =; q(x)(rj41 — )(x — xj) dz, 
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and the right-hand sides 


Fw) = 54 


These equations illustrate two general features of the finite element meth- 
ods. Firstly, it is characteristic for the finite element method that the co- 
efficients are computed by the same formula for each subinterval, i.e., for 
each of the finite elements into which the total interval is subdivided. 

Secondly, as already mentioned earlier, the Galerkin method is only 
semidiscrete. In order to make it fully discrete, a numerical quadrature 
has to be applied. If we remain within our framework of approximations 
and approximate p, g, and r by linear splines, we obtain 


Lj+1 


[ r(x)(@ — 23-1) dz + [ r(x)(2;41 — 2) te} . 


1 h 
S(w;,wj) & ah (pj—1 + 2p; + Pj+1) + 75 (qj-1 + 6g; + gj41) 


and 1 h 
S(w;,wj+41) ~ oh (p; + pj41) + 12 (q; + qj+1) 


for the matrix elements, and 


F(w;) ® (r5—1 + 4rj +1541) 
for the right-hand sides. Here, as above, we have set p; = p(z;), qj = q(2j), 
and r; = r(z;). Similar to the linear system (11.10)—(11.11) for the finite 
difference method, the tridiagonal linear system is irreducible and weakly 
row-diagonally dominant. It also is accessible to convergence acceleration 
of the Jacobi iterations by relaxation and multigrid methods. 

In order to derive an error estimate for the semidiscrete version of the 
finite element method with linear splines from Theorem 11.17, we need an 
estimate for the interpolation error for linear splines with respect to the 
H' norm (see also Theorem 8.33). 


Lemma 11.26 Let f[a,b] € C*[a,b]. Then the remainder Rif := f —Iif 
for the linear interpolation at the two endpoints a and b can be estimated 


by 
|Rifllze < (b-a@)7||f" Ize, 


Ri f)'Iinz < (6 - a) ||P" Ilz2- 
Proof. For each function g € Ca, b] satisfying g(a) = 0, from 


(11.49) 


g(x) = [ g'(€) d€, 
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by using the Cauchy—Schwarz inequality we obtain 
lg(a)° < (6—a)|lg'lli2, 2 € [a, 8). 
From this, by integration we derive the Friedrich inequality 
llgliz2 < (6—a)Ilg'llz2 (11.50) 
for functions g € C'[a, b] with g(a) = 0 (or g(b) = 0). Using the interpola- 


tion property (Ri f)(a) = (R, f)(b) = 0, by partial integration we obtain 


b b 
/ [f’ — (Lif)'Pdz = / f'(Lif — f) dz. 
From this, again applying the Cauchy—Schwarz inequality, we have 


Rif) Wee < WF Nee Wa flize, 


whence (11.49) follows with the aid of Friedrich’s inequality (11.50) for 
g=fif. Oo 


Theorem 11.27 The error in the finite element approximation by linear 
splines for the boundary value problem (11.36)—(11.87) can be estimated by 


lun — ullen < Cllu"||zeh (11.51) 
for some positive constant C. 


Proof. By summing up the inequalities (11.49), applied to each of the 
subintervals of length h, for the interpolating linear spline w, € X, with 
Wn(z;) = u(z;) for 7 = 0,...,n we find that 


jw, — u'|ln2 < lu" ln2h 


and 
|! — ullz2 < flu" In2h?, 


whence 


inf lv — Ulla < [lwn — ull < + 6 a)||u"||L2h 
VU n 


follows. Now (11.51) is a consequence of the error estimate for the Galerkin 
method of Theorem 11.17. oO 


By the following trick, which was independently developed by Aubin 
(1967) and Nitsche (1968), we can improve the error estimate in the Lo 
norm to the order O(h”) that we expect for approximations using linear 
splines. 
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Theorem 11.28 The error in the finite element approzimation by linear 
splines for the boundary value problem (11.36)-(11.87) can be estimated by 
llun — ullz2 < Cllu"||p2h? 
with some positive constant C’. 


Proof. Denote by zn the weak solution to the boundary value problem with 
the right-hand side u — up; ie., 


S(v, Zn) = (v,u — Un) 72 
for all v € Hé[a, b]. In particular, inserting v = u — un, it follows that 
S(u — Un, Zn) = |lu — upl|}2. (11.52) 


Since S(v,u) = F(v) and S(v,un) = F(v) for all v € Xp, using the sym- 
metry of S we have 
S(u — un, v) = 0 


for all v € X,. Inserting the Galerkin approximation to z,, which we denote 
by Zn, into the last equation and subtracting from (11.52), we obtain 


\|u — unllZ2 = S(u— un, Zn — Zn). (11.53) 


Since S is bounded, from (11.53) and (11.51), applied to u—Uuy and Zn — Zn, 
we can conclude that 


lu — Unllz2 < Crllu"||r2llznllnah” 
for some constant C,. However, from (11.47) we also have that 
lle@nllz2 < C2|lu — valle 


for some constant C>. Now the assertion of the theorem follows from the 
last two inequalities. O 


We refrain from describing both the extension of this analysis to higher- 
order splines such as cubic splines (see Problem 11.19) and the extension 
to partial differential equations. For the latter we refer to [4, 11]. 


Problems 


11.1 Consider multiple shooting for the boundary value problem 
u'+u=0, u(a) = u(b) = 0, 


with n equidistant subintervals. Show that the corresponding linear system (11.6) 
is uniquely solvable, provided that (b — a)/m ¢ IN. 
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11.2 Write a computer program for multiple shooting using the Newton method 
and the Runge-Kutta method and test it for various examples. 
11.3 Show that the boundary value problem for the differential equation 
u'=f(z,u,u), a<2xr<b, 


with inhomogeneous boundary conditions u(a) = a and u(b) = @ can be equiva- 
lently transformed into a boundary value problem with homogeneous boundary 
condition. 


11.4 Show that the general solution of the linear differential equation (11.7) is 
given by (11.9). 


11.5 For p € C'[a,6] and q € Cla, b] show that the boundary value problem 
u'+pu+qu=r in{a,b], u(a) = u(b) =0, 


is solvable for each right-hand side r € Cla, b] if and only if the boundary value 
problem 
u'+pu +qu=0 in[a,b], u(a) = u(b) = 0, 


admits only the trivial solution u = 0. 
11.6 Find the solution of the boundary value problem 
u(x) +u(z)=e7, u(0) =u(1) =0. 


11.7 Write a computer program for the finite difference method (11.10)—(11.11) 
and test it for various examples. 


11.8 Find the explicit solution for the finite difference approximation (11.10)- 
(11.11) for the boundary value problem 

vu’ —-u=-—2 in [0,1], u(0) = u(1) =0, 
and verify the convergence result of Theorem 11.8. 


11.9 Show that the error in the finite difference approximation (11.15) is of 
order O(h”) and that the error in the approximation (11.16) is of order O(h’*). 


11.10 Prove the estimate ||uo||.. < 1/8 for the solution to the boundary value 
problem (11.21). 


11.11 In the space Cla, b] with scalar product 
b —— 
(u,v) =| u(x)v(x) dz, 
define a functional F' : C[a,b] — C by 
b 
F(u) =| u(x) dz. 


Show that F is linear and bounded. Is there an f € C[a, b] such that F(u) = (u, f) 
for all u € Cla, 6]? Does your answer agree with the Riesz Theorem 11.11? 
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11.12 In the pre-Hilbert space of Problem 11.11, for a fixed x € [a,b] consider 
the point evaluation functional F' : Cla, b| — C defined by 


F(u) := u(z). 
Is F linear and bounded? 


11.13 Let X and Y be Hilbert spaces and let A: X — Y be a bounded linear 
operator. Show that there exists a uniquely determined bounded linear operator 
A* :Y — X such that 

(Au, v) = (u, A*v) 


for all wu € X and v € Y. The operator A” is called the adjoint operator of A. 
Show that || Al] = ||A*]. 


11.14 Let A: X + X be a bounded, self-adjoint, and positive operator in 
a Hilbert space X; i.e., (Au,v) = (u, Av) for all u,v € X and (Au,u) > 0 
for all u # 0. Choose wo € X and define w; = Awj;-1 for 7 = 1,...,n—1. 
Show that the Galerkin equations for Au = f with respect to the subspaces 
Xn = span{wo,...,Wn-—1} are uniquely solvable for each n € IN. Moreover, if f 
is in the closure of span{ A? wo: 7 = 0,1,...}, then the Galerkin approximation 
Un converges to the solution of Au = f. 

Show that in the special case wo = f the approximations u, can be computed 
iteratively by the formulae uo = 0, po = f, and 


Un+1 = Un — AnPn, 

Pn = Tr + Bn-1Pn-1; 

Pn = Tn—-1 — An—-1APn-1, 
(Tr—1, Pn—1)/(APn—1, Pn-1); 
—(rn, Apn-1)/(Apn-1,Pn-1)- 


QAn—-1 


Bn-1 


Here r, is the residual r, = Au, — f. This is the conjugate gradient method of 
Hestenes and Stiefel. 


11.15 Let A: X — X be a bounded, self-adjoint, and positive operator in a 
Hilbert space X; i.e., (Au,v) = (u, Av) for all u,v € X and (Au,u) > 0 for all 
u # 0. Show that solving the equation Au = f is equivalent to minimizing the 
so-called energy functional 


E(v) := (v, Av) — 2 Re(v, f) 


on X. Show that the Galerkin approximation with respect to a subspace Xp, 1s 
equivalent to minimizing F on X,. This method is known as the Raylezgh—Ritz 
method. 


11.16 Show that under the assumptions of Problem 11.15 for the Galerkin 
equations the SOR method of Section 4.2 converges for 0 < w < 2. 


11.17 Show that the weak derivative, if it exists, is unique and that each func- 
tion with vanishing weak derivative must be constant almost everywhere. 
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11.18 Write a computer program for the finite element method with linear 
splines and test it for various examples. Compare the numerical results with 
those for the finite difference method. 


11.19 Let B_,;, Bo, Bi,..., Bn, Bn+i1, Bn+2 denote the cubic B-splines for the 
equidistant grid 2; :=a+ jh, 7 =0,...,n+1, with step size h = (b—a)/n. Show 
that 

Uo i= Bo —_ 4B_,, Ui, w= By _ B_y, 


U2 = Bo,...,Un-1 = Bn-1, 


Un i= B, — Bn+e, Un+1 — B, — ABn+2 


is a basis for 5? M H4[a, b], i.e., for the space of cubic splines vanishing at the 
endpoints. 

Using this basis, set up the Galerkin equations for the Sturm—Liouville problem 
analogous to the case of linear splines treated in Section 11.5. 


11.20 Formulate and prove analogues of Theorems 11.27 and 11.28 for the finite 
element approximation using cubic splines as in Problem 11.19. 


12 


Integral Equations 


The topic of the last chapter of this book is linear integral equations, of 
which 


b 
/ K(a,y)yly) dy = f(z), © € [a,}], 
and 


b 
ple) - / K(c,y)o(y) dy = f(a), 2 € [a,8), 


are typical examples. In these equations the function y is the unknown, 
and the so-called kernel K and the right-hand side f are given functions. 
The above equations are called Fredholm integral equations of the first and 
second kind, respectively. Since both the theory and the numerical approx- 
imations for integral equations of the first kind are far more complicated 
than for integral equations of the second kind, we will confine our presen- 
tation to the latter case. 

Integral equations provide an important tool for solving boundary value 
problems for both ordinary and partial differential equations (see Problem 
12.1 and [39]). Their historical development is closely related to the solution 
of boundary value problems in potential theory in the last decades of the 
nineteenth century. Progress in the theory of integral equations also had a 
great impact on the development of functional analysis. 

Omitting the proofs, we will present the main results of the Riesz theory 
for compact operators as the foundation of the existence theory for integral 
equations of the second kind. Then we will develop the fundamental ideas 
of the Nystrom method and the collocation method as the two most im- 
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portant approaches for the numerical solution of these integral equations. 
This is done in a general framework of operator equations and their ap- 
proximate solution, which makes the analysis more widely applicable. For 
a comprehensive study of both the theory and the numerical solution of 
linear integral equations we refer to [39]. 


12.1 The Riesz Theory 


This section is devoted to a summary of some of the basic facts of the theory 
of Fredholm integral equations of the second kind. The integral equations 
formulated above carry the name of Fredholm, since in 1902 Fredholm 
established an existence theory for integral equations of the second kind 
with continuous kernels, which is now known as the Fredholm alternative. 
For the purpose of this introduction to the numerical solution of integral 
equations it suffices to consider only the first and most important part of 
this alternative, which states that the inhomogeneous equation 


b 
o(c)- | K(c,y)elv) dy = fle), 2 € [a,8) (12.1) 


with continuous kernel K has a unique solution y € Ca, b] for each right- 
hand side f € C[a, b] if and only if the homogeneous integral equation 


b 
o(t)- | K(e,vely)dy =0, 2 € (a,b), (12.2) 


has only the trivial solution. The importance of this result originates from 
the fact that it reduces the difficult problem of establishing existence of 
a solution to the inhomogeneous integral equation to the simpler problem 
of showing that the homogeneous integral equation allows only the trivial 
solution y = O, and it extends the corresponding statement for systems 
of linear equations to the case of integral equations. Actually, Fredholm 
derived his results by interpreting integral equations as a limiting case of 
linear systems by considering the integral as a limit of Riemann sums and 
passing to the limit in Cramer’s rule for the solution of linear systems. For 
the solution of integral equations with continuous kernels, Fredholm’s ap- 
proach is still the most elegant and shortest. However, since it is restricted 
to the case of continuous kernels, it is more convenient to consider the 
above equations as a special case of operator equations of the second kind 
with a compact operator, as presented by Riesz in 1918. 


Definition 12.1 A linear operator A: X — Y from a normed space X 
into a normed space Y is called compact if for each bounded sequence (Yn) 
in X the sequence (Ay,,) contains a convergent subsequence in Y, i.e., if 
each sequence from the set {Ay: yp € X, ||p|| < 1} contains a convergent 
subsequence. 
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Without developing the concept of compactness in normed spaces in any 
detail, we note that this definition is equivalent to requiring that the set 
{Ay: yp € X, |ly|| < 1} be relatively sequentially compact. 

Compact operators are bounded, linear combinations of compact opera- 
tors are compact, and products of two bounded operators are compact if 
one of them is compact (see Problem 12.2). From the Bolzano—Weierstrass 
theorem it can be seen that bounded operators A : X — X with finite- 
dimensional range A(X) := {Ay : y € X} are compact. Furthermore, the 
identity operator [: X — X, defined by I: y+ vy for all y € X, is com- 
pact if and only if the space X is finite-dimensional. This actually justifies 
the distinction between the equations Ay = f and y— Ay = f as equations 
of the first and second kind, since A and J — A have different properties in 
infinite-dimensional spaces if A is compact. A proof of these facts and of 
the following important theorem can be found in most introductory books 
on functional analysis, for example in [39]. 

The fundamental result of the Riesz theory is described by the following 
theorem, which extends Fredholm’s result on the equivalence of injectivity 
and surjectivity to the case of operator equations of the second kind with 
a compact operator. 


Theorem 12.2 Let A: X — X be a compact operator in a normed space 
X. Then I-A 1s surjective if and only tf it 1s injective. If the inverse 
operator (I — A)~!: X - X exists, it is bounded. 


In order to verify that Fredholm’s existence analysis for integral equations 
with continuous kernels K : [a,b] x [a,b] - IR can be viewed as a special 
case of Theorem 12.2, we have to establish that the linear integral operator 
A: Cla, b] + C[a, 6], defined by 


b 
(Ay)(2) == / K(e,y)o(y)dy, 2 € [a, 8], (12.3) 


is compact. For this we need the following theorem due to Arzela—Ascoli, 
which again is proven in most introductions to functional analysis. 


Theorem 12.3 (Arzela—Ascoli) Each sequence from a subset U C Cla, }] 
contains a uniformly convergent subsequence; i.e., U is relatively sequen- 
tially compact, if and only if tt 1s bounded and equicontinuous, 1.e., if there 
exists a constant C’ such that 


lp(x)| < © 


for all x € [a,b] and all yp € U, and for every € > 0 there exists 6 > 0 such 
that 


lp(z) — py)| <e 
for all x,y € [a,b] with |x — y| < 6 and all yp € U. 
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Theorem 12.4 The integral operator (12.3) with continuous kernel is a 
compact operator on C{a, b]. 


Proof. For all y € C[a, b] with ||y||.. < 1 and all x € [a,b], we have that 


(49)(2)| < (ba) max, |K(2,9)) 


i.e., the set U := {Ay: yp € Cla, BI, ||yllo. < 1} C Cla, b] is bounded. Since 
K is uniformly continuous on the square [a, | x [a, b], for every c > 0 there 
exists 6 > 0 such that 


JK (2,2) — K(y,z)| < ~— 


for all x,y, z € [a,b] with |z — y| < 6. Then 


b 
|(Ay)(2) - (Ag)(y)| = / IK (#, 2) — K(y,2)]o(2) dz| <e 


for all x,y € [a,b] with |x — y| < 6 and all y € Cla, b] with ||ylloo < 1; ive., 
U is equicontinuous. Hence A is compact by the Arzela—Ascoli Theorem 
12.3. O 


In our analysis we also will need an explicit expression for the norm of 
the integral operator A. 


Theorem 12.5 The norm of the integral operator A : Cla,b] — Ca, }] 
with continuous kernel K is given by 


b 
|Alloo = max, / IK (a,y)| dy. (12.4) 


Proof. For each yp € Cla, b] with ||y||oo < 1 we have 


b 
\(Ag)(2)| < / IK(x,y)|dy, x € [a,}], 


and thus 


b 
Allo = sup |JAglleo < max / IK (a, y)| dy. 
Pleo <1 astsb fq 


Since K is continuous, there exists xo € [a, b] such that 


b b 
/ |K (x0, y)| dy = max, | | (x, y)| dy. 
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For € > 0 choose w € Cla, b] by setting 


— K (x0, y) a 
ply) = iKimoyite’ ¥S (a, b. 
Then IP lloo < 1 and 
[K (xo, y)]? ’ [K (20, y) |? — 


WAd|loo 2 (AW) (xo)| = , |K(@ory)| +e dy 2 . |K(to,y)| +€ 


b 
= / |K (xo, y)| dy — e(b — a). 


Hence 


b 
Alle = sup llAvllac > IA¥loo > / 1K (20, y)| dy — e(b — a), 
] a 


Plloo S 


and since this holds for all ¢ > 0, we have 


b b 
lAlloo = [UK (eo,y)ldy = max [| (x, y)| dy. 
a a<xz<b J, 
This concludes the proof. 0 


It also can be shown that the integral operator remains compact if the 
kernel K is merely weakly singular (see [39]). A kernel K is said to be 
weakly singular if it is defined and continuous for all x,y € [a,b], x # y, 
and there exist positive constants M and a € (0, 1] such that 


|K(x,y)| < M|x — y|°~" 


for all x,y € [a,b], x F y. 


12.2 Operator Approximations 
The fundamental concept for approximately solving an operator equation 
p—-Ap=f 
of the second kind is to replace it by an equation 
Yn — Ann = fn 


with approximating sequences A, — A and f, — f as n — oo. For com- 
putational purposes, the approximating equations will be chosen such that 
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they can be reduced to solving a system of linear equations. In this section 
we will provide a convergence and error analysis for such approximation 
schemes. In particular, we will derive convergence results and error esti- 
mates for the cases where we have either norm or pointwise convergence of 
the sequence A, > A, n > oo. 


Theorem 12.6 Let A: X > X be a compact linear operator on a Banach 
space X such that I—A is injective. Assume that the sequence A, : X — X 
of bounded linear operators is norm convergent, t.e., ||An—A|| + 0, - oo. 
Then for sufficiently large n the inverse operators (I — A,)~' : X 7 X 
exist and are uniformly bounded. For the solutions of the equations 


yp-Ap=f and Yn — AnYn = fn 
we have an error estimate 


len — Yl] < C{](An — A) ll + If — FID (12.5) 


for some constant C.. 


Proof. By the Riesz Theorem 12.2, the inverse (I — A)~! : X > X exists 
and is bounded. Since ||A, — A|| — 0, n — oo, by Remark 3.25 we have 
|((1 — A)~1(A, — A)|| <q < 1 for sufficiently large n. For these n, by the 
Neumann series Theorem 3.48, the inverse operators of 


I — (I — A)7"(Ay — A) = (I — A)*(E — An) 


exist and are uniformly bounded by 


_ _ 1 
IE — i — A) (An — AI S T-9° 
But then [J — (I — A)~!(A— A,)]~'U — A)7? are the inverse operators of 
I — A, and they are uniformly bounded. 
The error estimate follows from 


(I — An)(~n — 9) = (A—An)ot fnr—-f 


by the uniform boundedness of the inverse operators (I — A,)7'. Oo 


In order to develop a similar analysis for the case where the sequence 
(A,,) is merely pointwise convergent, i.e., Any > y, n — o, for all y, we 
will have to bridge the gap between norm and pointwise convergence. ‘This 
goal will be achieved through the concept of collectively compact operator 
sequences and the following uniform boundedness principle. 


Theorem 12.7 Let the sequence A, : X — Y of bounded linear operators 
mapping a Banach space X into a normed space Y be pointwise bounded; 
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i.e., for each p € X there exists a positive number Cy depending on 
such that ||Any|| < Cy for alln € IN. Then the sequence (Ay) is uniformly 
bounded; i.e., there exists some constant C’ such that ||A,|| < C for all 
nel. 


Proof. In the first step, by an indirect proof we establish that positive 
constants M and p and an element yw € X can be chosen such that 


|Angl| < M (12.6) 


for all yp € X with ||p — || < p and all n € IN. Assume that this is not 
possible. Then, by induction, we construct sequences (n,) in IN, (p,) in R, 
and (y;) in X such that 

An. ¥ll 2 & 


for k = 0,1,2,... and w with ||y — ys|| < pz and 


O< pr < ; Pr-1, ||\Pk — Grill < - Pk-1 
fork =1,2,.... 

We initiate the induction by setting np = 1, pop = 1, and yo = 0. Assume 
that n, € IN, py > 0, and wy, € X are given. Then there exist nz41 € IN 
and yxz41 € X satisfying ||yr+1 — yell < px /2 and ||An,,,Yer41|| > & + 2. 
Otherwise, we would have ||A,y|| < k+2 for all y € X with ||p—ye]| < px /2 
and all n € IN, and this contradicts our assumption. Set 


. { Pk 1 Pk 
mos min (2 pha) Be 
2 Aner I 2 


Then for all y € X with ||y — yeii|] < pe41, by the triangle inequality we 
have 


Aner ¥Il 2 (Ania 1 Pett || _ Anna (yp _ Yr+1)| 2 k + 1, 


since Anwar (yp ~~ Yr+1)I| < Ani: leks <1. 
For 7 > k, using the geometric series we have 


lex — ¥yll < [lee — Gegill #--- + Ilej—1 — ¥;| 


1 1 
_ --»- tog; 1 < op. 
5 Pkt + 5 Pi 1S Pk 


IA 


Therefore, (y,) is a Cauchy sequence and converges to some element y in 
the Banach space X. From ||y,; — y;|| <p, for all 7 > k by passing to the 
limit 7 — oo we see that ||y, — y|| < px for all k € IN. Therefore, we have 
l|An, || > & for all & € IN, which is a contradiction to the boundedness of 
the sequence (Any). 
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Now, in the second step, from the validity of (12.6) we deduce for each 
yp € X with ||y|| < 1 and for all n € IN the estimate 


1 2M 
| Ang] = , |An(py + 2b) — Anw|| < 7 


This completes the proof. O 


Following Anselone [2], we introduce the concept of collectively compact 
operator sequences. 


Definition 12.8 A sequence A, : X — Y of linear operators from a 
normed space X into a normed space Y 1s called collectively compact if 
each sequence from the set {Any : p € X, ||y|| < 1, € IN} contains a 
convergent subsequence. 


Each operator A, from a collectively compact sequence is compact. 


Lemma 12.9 Let X be a Banach space, let A, :X — X be a collectively 
compact sequence, and let B, : X + X be a pointwise convergent sequence 
with limit operator B: X — X. Then 


[|(B, -— B)A,|| 30, n—- oo. (12.7) 


Proof. Assume that (12.7) is not valid. Then there exist €g > 0, a sequence 
(n,) in IN with ng > 00, k - ov, and a sequence (y,) in X with |]y,|| < 1 
such that 

(Br, — B)An, vel] > €0, &=1,2,.... (12.8) 


Since the sequence (A,,) is collectively compact, there exists a subsequence 
such that 
Any 5) Pj) >weEexX, JO. (12.9) 


Then we can estimate with the aid of the triangle inequality and Remark 
3.29 to obtain 


(Brac) _ B) Ana) Pr(j) || 


< Bric — B)w| + Brags) _ B\| lI Anes) PRC) —_ Il. 


(12.10) 


The first term on the right-hand side of (12.10) tends to zero as 7 + o, 
since the operator sequence (B,,) is pointwise convergent. The second term 
tends to zero as j7 —> oo, since the operator sequence (B,) is uniformly 
bounded by Theorem 12.7 and since we have the convergence (12.9). There- 
fore, passing to the limit 7 > oo in (12.10) yields a contradiction to (12.8), 
and the proof is complete. O 


12.2 Operator Approximations 295 


Theorem 12.10 Let A: X — X be a compact linear operator on a Ba- 
nach space X such that I — A is injective, and assume that the sequence 
An : X — X of linear operators is collectively compact and pointwise con- 
vergent; i.e., Any + Ay, n — ov, for all p € X. Then for sufficiently 
large n the inverse operators (I — A,)~': X — X exist and are uniformly 
bounded. For the solutions of the equations 


p-Ap=f and yn—-—AnYn= fn 


we have an error estimate 


Ion — Yl] < C{[|(An — A)gll + Il fn — FIL (12.11) 


for some constant C. 


Proof. By the Riesz Theorem 12.2, the inverse (I — A)~' : X > X exists 
and is bounded. The identity 


(I-A)7'=I1+(I-A)7'A 


suggests 
My := 1+ (I-A) An 


as an approximate inverse for [ — A,. Elementary calculations yield 
M,(U - An) =I - Sn, (12.12) 


where 


Sn t= (I — A)7}(An — A)An- 


From Lemma 12.9 we conclude that ||S;,,|| > 0, nm — oo. Hence for suffi- 
ciently large n we have ||S,|| <q < 1. For these n, by the Neumann series 
Theorem 3.48, the inverse operators (I — S,)~! exist and are uniformly 
bounded by 


1 
_ “Tc a 
|= Sn) S 


Now (12.12) implies first that J — Ay is injective, and therefore, since A, is 
compact, by Theorem 12.1 the inverse (I — A,,)~! exists. Then (12.12) also 
yields (I — A,)~' = (I — Sn)~'My, whence uniform boundedness follows, 
since the operators M,, are uniformly bounded by Theorem 12.7. The error 
estimate (12.11) is proven as in Theorem 12.6. O 


Note that both error estimates (12.5) and (12.11) show that the accuracy 
of the approximate solution essentially depends on how well A,p approxi- 
mates Ay for the exact solution y. 


296 12. Integral Equations 
12.3 Nystrom’s Method 


Recalling Chapter 9, we choose a convergent sequence 


Qn(g) = S- ak” g(a”) 


k=0 


of quadrature formulae for the integral 


b 
Q(g) = / g(x) dx 


with quadrature points a 0 [a,b] and real quadrature weights 
ah”), ...,a”. For convenience we > write £0,---,2n instead of a ye ah, 
and ag,...,@n, instead of ah) ...,a”. We approximate the integral oper- 
ator 


= [ atone y)dy, «€ |{a,), 


with continuous kernel K by a sequence of numerical integration operators 
(Any)(z) =) anK (a, 2x)p(ae), 2 € [a, 0]; 
k=0 


l.e., we apply the quadrature formulae for g = K(z,-)y. Then the solution 
to the integral equation of the second kind 


y-Ap=f 
is approximated by the solution of 
n—- An¥n = f, 
which reduces to solving a finite-dimensional linear system. 
Theorem 12.11 Let yy, be a solution of 
n(x) — > aK (a, 2%)Pn(te) = f(x), 2 € [a,b]. (12.13) 
k=0 


Then the values pi” = Yn(2;), J = 0,...,n, at the quadrature points 
satisfy the linear system 


— Sa. K (23, 2%) 9,” = f(zj;), jg=0,...,n. (12.14) 
k=0 
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Conversely, let os”, j =0,...,n, be a solution of the system (12.14). Then 
the function yp, defined by 


Pn(@) = f(a) + \oarK(e,24)o,", 2 € [a,b], (12.15) 
k=0 
solves equation (12.13). 
(n) 


Proof. The first statement is trivial. For a solution ~; »j =9,...,Nn, of 
the system (12.14) the function y, defined by (12.15) has values 


pn(j) = f(aj) +S aeK (2j, 20) =~, § =0,...,0. 
k=0 


Inserting this into (12.15), we see that y, satisfies (12.13). Oo 


The formula (12.15) may be viewed as a natural interpolation of the val- 


ues es”, 7 =0,...,n, at the quadrature points to obtain the approximating 


function y,,. It was introduced by Nystrém in 1930. 
For convenience we note the following analogue of Theorem 12.5. 


Theorem 12.12 The norm of the quadrature operators A, is given by 
nr 
|Anlloo = max > | |arK (x, 2¢)|, (12.16) 
-"=" k=0 
Proof. For each y € C[a, b] with ||y||.. < 1 we have 
n 
Angelle: < may, Slax K (2, 24)) 


and therefore ||A,||.. is smaller than or equal to the right-hand side of 
(12.16). Let z € [a, b] be such that 


nr n 
) la, K(z,2,)| = max ) la, K(x, 2%) 
a<x<b 
k=0 k=0 


and choose w € C[a, b] with ||~||.. = 1 and 
anrK (z,2~)v(ap) = lap K(z,24)|, k=O,...,n. 
Then a 
Anlloo > [Antlloo > |(An¥)(2)| = D> lar K (z,22)1, 
k=0 


and (12.16) is proven. Oo 


The error analysis will be based on the following theorem. 
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Theorem 12.13 Assume the quadrature formulae (Qn) to be convergent. 
Then the sequence (Ay) is collectively compact and pointwise convergent 
(i.e., Anp > Ay, n - oo, for all p € Cla, b]) but not norm convergent. 


Proof. Since the quadrature formulae (Q,,) are assumed to be convergent, 
by (9.13) and the uniform boundedness principle Theorem 12.7 there exists 
a constant C’' such that the weights satisfy 


Slay | <C 
k=0 


for all n € IN (see Theorem 9.10). Then we can estimate 


Anglloo $C max |K(z,y)I IlPlloo (12.17) 
a<az,y<b 


and 
|(Any)(#1) — (Any) (2)| < C max, |K(21,y) — K(z2,y)| [lvlloo (12.18) 


for all 11, 22 € [a,b]. From (12.17) and (12.18) we see that 
{Any : 9 € C[a, 4], ||¢lloo < 1, n € IN} 


is bounded and equicontinuous because the kernel K is uniformly contin- 
uous on [a,b] x [a,b]. Therefore, by the Arzela—Ascoli Theorem 12.3 the 
sequence (A,,) is collectively compact. 

Since the quadrature is convergent, for fixed y € C[a,}b] the sequence 
(Any) is pointwise convergent; i.e., (Any)(x) > (Ay)(z), n > o, for all 
x € [a,b]. As a consequence of (12.18), the sequence (Any) is equicontinu- 
ous. Hence it is uniformly convergent: ||A,y — Ay]||.. + 0, n > oo. That 
is, we have pointwise convergence: A,y — Ay, n —- oo, for all y € Cla, 5] 
(see Problem 12.7). 

For € > 0 choose a function wy, € Cla,b] with 0 < (x) < 1 for all 
z € [a,b] such that #,(2) = 1 if minj=o,...n |v — 2j| > € and ~-(x;) = 0, 
7=0,...,n. Then 


eaeyg 


b 
Alou.) — Aglloo $_ max, |K(e,y)| [1 ve} dy +0, © +0 


for all y € C[a, b] with ||y||.. = 1. Using this result, we derive 
|A—Anlloo = sup ||(A—An)¢lloo > sup sup ||(A — An) (pve) loo 
1 


P|loo= Ilello=1 €>0 


= sup sup ||A(yye)lloo > sup ||Aylloo = ||Alloo, 
Ilello=l €>0 Pllo=l 


whence we see that the sequence (A,,) cannot be norm convergent. O 
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Theorem 12.13 enables us to apply the approximation theory of Theorem 
12.10. For the discussion of the error based on the estimate (12.11) we need 
the norm ||Ay — Any||oo. It can be expressed in terms of the error for the 
corresponding numerical quadrature by 


b n 
/ K (2, y)e(y) dy — | ag K (x, 2%) (2e) 


k=0 


Ap — Anylloo = max, 


and requires a uniform estimate for the error of the quadrature applied 
to the integration of K(a,-)y. Therefore, from the error estimate (12.11), 
it follows that under suitable regularity assumptions on the kernel K and 
the exact solution y, the convergence order of the underlying quadrature 
formulae carries over to the convergence order of the approximate solutions 
to the integral equation. We illustrate this by the case of the trapezoidal 
rule. Under the assumption y € C?[a,b] and K € C*({a, 6] x [a, 6]), by 
Theorem 9.7, we can estimate 


2 


|Aw~ — Anylla < 1 h?(b—a) max 


Sp ax, | 5,2 [K(z,y)p(y)]] - 


Example 12.14 Consider the integral equation 


1 f° 1 1 
ote) 5 f (e+e pdy =e" F456, OSes. 
0 


(12.19) 
with exact solution y(z) = e~*. For its kernel we have 


ea, + 1)e~*4%dy = tly e *)<1 
max, ; x € y = sup , 


2 O<a<1 22 


Therefore, by the Neumann series Theorem 3.48 and the operator norm 
(12.4), equation (12.19) is uniquely solvable. 

We use the (composite) trapezoidal rule for approximately solving the 
integral equation (12.19) by the Nystr6m method. Table 12.1 gives the 
difference between the exact and approximate solutions and clearly shows 
the expected convergence rate O(h?). 


TABLE 12.1. Numerical solution of (12.19) by the trapezoidal rule 


0.007146 | 0.008878 | 0.010816 | 0.013007 | 0.015479 


0.001788 | 0.002224 | 0.002711 | 0.003261 | 0.003882 
0.000447 | 0.000556 | 0.000678 | 0.000816 | 0.000971 
0.000112 | 0.000139 | 0.000170 | 0.000204 | 0.000243 
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We now use the (composite) Simpson’s rule for the integral equation 
(12.19). The numerical results in Table 12.2 show the convergence order 
O(h*), which we expect from the error estimate (12.11) and the convergence 
order for Simpson’s rule from Theorem 9.8. O 


TABLE 12.2. Numerical solution of (12.19) by Simpson’s rule 


4 | 0.00006652 | 0.00008311 | 0.00010905 | 0.00015046 | 0.00021416 | 


8 | 0.00000422 | 0.00000527 | 0.00000692 | 0.00000956 | 0.00001366 
16 | 0.00000026 | 0.00000033 | 0.00000043 | 0.00000060 | 0.00000086 


After comparing Tables 12.1 and 12.2, we wish to emphasize the major 
advantage of Nystrom’s method over other methods like the collocation 
method, which we will discuss in the next section. The matrix and the 
right-hand side of the linear system (12.14) are obtained by just evaluating 
the kernel K and the given function f at the quadrature points. Therefore, 
without any further computational effort we can improve considerably on 
the approximations by choosing a more accurate numerical quadrature for- 
mula. 

In the next example we consider an integral equation with a periodic 
kernel and a periodic solution. 


Example 12.15 Consider the integral equation 


ab f°" y(r)dr 
elt) + 1 | a? + b? — (a? — b?) cos(t + 7) HQ), Osts2n, 
(12.20) 


where a > b > O. This integral equation arises from the solution of the 
Dirichlet problem for the Laplace equation in an ellipse with semiaxis a 
and b (see [39]). Any solution y to the homogeneous form of equation 
(12.20) clearly must be a 27-periodic analytic function, since the kernel is 
a 27-periodic analytic function with respect to the variable t. Hence, we 
can expand y into a uniformly convergent Fourier series 


OO CO 
y(t) = Ss, Qn cos nt + S~ By sin nt. 


n=0 n=l 


Inserting this into the homogeneous integral equation and using the inte- 
grals (see Problem 12.10) 


ab 2m eT dr 7 (: —b 


x Jo (a2 +06?) — (a? —6?)cos(t+7) \at+b 


) emt (12.21) 
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for n = 0,1,2,..., it follows that 


ve [t+ (Ca) Jo Ga) |= 


for n = 0,1,2,.... Hence, an = B, = 0 for n = 0,1,2,..., and therefore 
y = 0. Now the Riesz Theorem 12.2 implies that the integral equation 
(12.20) is uniquely solvable for each right-hand side f. 

We numerically want to solve (12.20) in the case where the unique solu- 
tion is given by 


y(t) = e©** cos(sint), O<t< 2z. 
Using the integrals (12.21), it can be seen that the right-hand side becomes 
f(t) = y(t) + ef! cos(csint), O<t < 2z, 


where c = (a — b)/(a + Bb). 

Since we are dealing with periodic analytic functions, we use the rectan- 
gular rule. From Theorem 9.28 we expect an exponentially decreasing error 
behavior, which is exhibited by the numerical results in Table 12.3 giving 
the difference between the exact and approximate solutions. Doubling the 
number of quadrature points doubles the number of correct digits in the 
approximate solution. 


TABLE 12.3. Nystré6m method for equation (12.20) 


—0.15350443 0.01354412 | —0.00636277 
—0.00281745 0.00009601 | —0.00004247 
—0.00000044 0.00000001 | —0.00000001 


—0.69224130 | —0.06117951 | —0.06216587 
—0.15017166 | —0.00971695 | —0.01174302 
—0.00602633 | —0.00036043 | —0.00045498 
—0.00000919 | —0.00000055 | —0.00000069 


The actual size of the error, i.e., the constant factor in the exponential 
decay, depends on the parameters a and 6, which describe the location of 
the singularities of the integrands in the complex plane; i.e., they determine 
the width of the strip of the complex plane into which the kernel can be 
extended as a holomorphic function. 

Note that for periodic analytic functions the rectangular rule generally 
yields better approximations than Simpson’s rule (see Problem 9.12). O 
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We confine ourselves to these few examples for the application of the 
Nystrom method. For a greater variety the reader is referred to [1, 3, 6, 19, 
25, 30, 39, 49]. 

With the aid of appropriately chosen quadrature formulae, which take 
care of the singularity by a weighted product rule, the Nystrom method 
can also be successfully applied to weakly singular integral equations of the 
second kind (see [39]). 


12.4 The Collocation Method 


The collocation method for approximately solving an equation of the second 
kind 
yp — Ay = f (12.22) 


consists in seeking an approximate solution from a finite-dimensional sub- 
space by requiring that the equation (12.22) be satisfied at only a finite 
number of so-called collocation points. Assume that A : C[a,b] ~ Cla, 5] 
is a bounded linear operator and let X, = span{u\™,...,u%} Cc Cla,}) 
denote a sequence of subspaces with dim X, = n+ 1. Choose n+ 1 points 
a <a” <.-- < 2™ <b such that the interpolation at these grid 
points with respect to the subspace X, is uniquely solvable. Typical ex- 
amples for the choice of X, are polynomials, trigonometric polynomials, 
and splines (see also Problem 8.1). For convenience we will again write 
Zo,--.,2n instead of a”) ,..., 28”, and ug,.-.-, Un instead of ul ul”, 
By L, : Cla,b] + X, we denote the operator that maps the function 
f € Cla, b] into its uniquely determined interpolating function L,f € Xn 
with the property 


97° *9 


(Lnf)(2j) = flxj), 7 =90,..-,0. 


Representing L,, in terms of the Lagrange basis, i.e., in terms of the uniquely 
determined functions @9,...,&, € X, with the interpolation property 


€, (tj) = jk 7,k=0,...,7, 


in the form 


Lnf =) _ f(oe)ee (12.23) 
k=0 


it can be seen that the operator L, : C[a,b] + Xp is linear and bounded 
(with respect to the maximum norm). Moreover, since L,f = f for all 
f € Xy, the interpolation operator is a projection operator; i.e., L2 = Ly 
(see p. 157 and Problem 8.4) 
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The collocation method approximates the solution of (12.22) by an ele- 
ment Yn € Xp Satisfying 


Yn(zj) — (Apn) (23) = f(zj), J =0,...,n. (12.24) 


We express yy, as a linear combination 


n 
Pr = S- Vkuk 
k=0 


and immediately see that equation (12.24) is equivalent to the linear system 


Sve {un(2;) — (Aux)(2;)} = f(s), J =0,...,n, (12.25) 
k=0 
for the coefficients yo, ..., Yn. If we use the Lagrange basis for X,, and write 


n 
Pn = SS 16k: 
k=0 


then of course 7; = Yn(z;), Jj =0,...,n, and the system (12.25) becomes 


Vi Sve (Ale) (25) = f(z;), j3 =0,...,n. (12.26) 


k=0 


From the systems (12.25) and (12.26) it is obvious that the collocation 
method is only semidiscrete, since in general, additional approximations 
are needed in order to compute the matrix entries (Au,)(z;) or (Alx)(a;). 

The collocation method can be interpreted as a projection method; i.e., 
since the interpolating function is uniquely determined by its values at the 
interpolation points, equation (12.24) is equivalent to 


Yn — Ln Agn = Inf. (12.27) 


This equation can be considered as an equation in the whole space C{a, b] 
because any solution y, = LnAyn + Lyf automatically belongs to Xp. 
Hence, our general error and convergence results for operator equations of 
the second kind can be applied to the collocation method. 


Theorem 12.16 Let A : C'la,b| — Cla,b] be a compact linear operator 
such that I — A ts injective, and assume that the interpolation operators 
L, : Cla,b] > Xp satisfy ||LnA — Allo > 0, n - oo. Then, for suffi- 
ciently large n, the approzimate equation (12.27) is uniquely solvable for all 
f € C[a, b], and we have the error estimate 


len — Plloo < Cl|Lay — ¥lloo (12.28) 


for some positive constant C depending on A. 
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Proof. From Theorem 12.6 applied to A, = L,A, we conclude that for 
all sufficiently large n the inverse operators (I — L,A)~! exist and are 
uniformly bounded. To verify the error bound, we apply the interpolation 
operator L,, to (12.22) and get 


yp — Lr, Ap = Lyf +9 — Lng. 
Subtracting this from (12.27) we find 
(I — Ln A)(Yn — 9) = In — ¥, 
whence the estimate (12.28) follows. O 


Corollary 12.17 Let A : Cla,b] — Cla, 6] be a compact linear operator 
such that I — A is injective, and assume that the interpolation operators 
L,: Cla,b] ~ Xp are pointwise convergent; i.e., Lnip 9 yp, n > ©, for all 
y € Cla, b]. Then, for sufficiently large n, the approximate equation (12.27) 
is uniquely solvable for all f € Cla, b], and the estimate (12.28) holds. 


Proof. By Lemma 12.9 the pointwise convergence of the interpolation oper- 
ators L,, and the compactness of A imply that ||E,A— Allo 7 0,n > ow. 
Now the statement follows from the preceding theorem. O 

We note that the collocation method may of course also be applied in 
function spaces other than the space Ca, 8]. 


We proceed by considering the collocation method for integral equations 
of the second kind 


b 
p(e) - / K(x,y)o(y) dy = f(z), e€ [a,b], (12.29) 


with continuous kernel AK. Using the interpolation operator, in this case we 
can rewrite the collocation equation (12.26) in the form 


b 
Yn(z) -| [EnK(-,y)](2)yn(y) dy = (Lnf)(x), 2 € [a,b], (12.30) 
and the systems (12.25) and (12.26) become 
n b 
ar: fate -| K (x5, y)ur(y) wv} = f(zj), j=0,...,n, (12.31) 
k=0 a 
and 
n b 
wom [Kei ua) dy = fle), §=0.-5m, (12.32) 
k=0 a 


respectively. There exists a broad variety of collocation methods corre- 
sponding to various choices for the subspaces X,,, for the basis functions 
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Up,---, Um, and for the collocation points ro,...,£,,. We briefly discuss two 
possibilities, based on linear splines and on trigonometric polynomials. 

First we consider piecewise linear interpolation. Let x; = a+ jh, 
j =0,...,n, denote an equidistant subdivision with step size h = (b—a)/n 
and let X,, be the space of continuous functions on [a, b] whose restrictions 
on each of the subintervals [7;_1,2,;|, 7 = 1,...,n, coincide with a linear 
function. As in Section 11.5, the Lagrange basis is given by 


1 
h (x — r%_1), x € [xp-1, Lk], 
—J1 
fx(z) : h (Th41 — 2), © € [Xe Te41]; 
0, L g [Ve—1, 2e41], 
for k = 0,...,n. Since for piecewise linear interpolation we have that 


[Zn flloc S,max |f(x3)| < IIflloo 


with equality holding if f is constant, we observe that ||Zn||.. = 1 for the 
corresponding interpolation operator L,. Here, we have pointwise conver- 
gence L,p — y, n — oo. This can be seen from the error estimate (8.9) 
and the Weierstrass approximation theorem, analogous to the proof of the 
Szegd Theorem 9.10. Therefore, in this case Corollary 12.17 applies, and 
we can state the following result. 


Theorem 12.18 The collocation method with linear splines converges for 
integral equations of the second kind with continuous kernels. 


Provided that the exact solution of the integral equation is twice contin- 
uously differentiable, then from the error estimate (8.9) for linear interpo- 
lation and Corollary 12.17 we derive an error estimate of the form 


lon — Ylloo < Cle" llooh™ 


for the linear spline collocation approximate solution y,. Here, C’ denotes 
some constant depending on the kernel K. 

In general, in most practical problems the evaluation of the matrix entries 
in (12.32) will require a numerical quadrature for integrals of the form 
f. K(x;,y)€x(y) dy. To be consistent with our approximations, we replace 
K(x;,-) by its piecewise linear interpolation; i.e., we approximate 


b n b 
| Keas,wtew) dy ~ > K(e;,08) [yee ay 
a i—0 a 
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for j,k = 0,...,n. Straightforward calculations yield the tridiagonal matrix 
2 1 
14 1 
h 141 
W=- 
6 . 
14 1 
1 2 


for the weights w;, = I. ls (y)ex(y) dy. 

We now investigate the influence of these approximations on the error 
analysis. We interpret the solution of the system (12.32) with the approxi- 
mate values for the coefficients as the solution ¢, of an additional approx- 
imate equation 

Gn — AnGn = Inf, (12.33) 


namely of the collocation equation 


b 
Gn(z) - / UmKn(-,y)l(z)Gn(y) dy =(Lnf(z), a<a<b, 
with hn 
Kn(a,y) = dK (2, 2i)eily): 


i.e., Ky(x,y) = [LnKn(z,-)|(y) interpolates K with respect to the second 
variable. We assume that the kernel K is twice continuously differentiable 
on [a,b] x [a,b]. Then, using the error estimate (8.9), we have 

h? ||\0°K 


_ < 
|K (x,y) — Kn(z,y)| < = lop 


OO 


for alla < z,y < b. Writing 
Ky(2,y) — [EnKn(-,y)|(t) = Ln{K(a,-) -— [Ln Kal, -)](2) Hy) 

and using the fact that for the piecewise linear spline interpolation we have 

[|Zn|loo = 1, from (8.9) we obtain 

O7K 

Ox? |, 


2 
Ku(0,y) ~ Ln Kn(-sw)|(2)| So 


for all a < x,y < b. Hence, in view of (12.4), for the integral operator Ay 
with kernel Ky, we have ||A, — Al]. = O(h”). When f is twice continuously 
differentiable, we also have ||L,f — f||oo = O(h”). Therefore, from Theorem 
12.6 we can now conclude that the approximate equation (12.33) is uniquely 
solvable for sufficiently large n and that for the approximate solution we 
have an error estimate ||¢, — y||.o = O(h?). Therefore, the fully discrete 
approximation still is of order O(h?). 
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Example 12.19 Consider the integral equation (12.19) of Example 12.14. 
Table 12.4 gives the error between the exact solution and the fully discrete 
collocation approximation with linear splines. It clearly exhibits the error 


behavior O(h7). oO 


TABLE 12.4. Numerical results for spline collocation 


0.004808 | 0.005430 | 0.006178 | 0.007128 | 0.008331 


0.001199 | 0.001354 | 0.001541 | 0.001778 | 0.002078 
0.000300 | 0.000338 | 0.000385 | 0.000444 | 0.000519 
0.000075 | 0.000085 | 0.000096 | 0.000111 | 0.000130 


We note that in principle, a collocation method with error O(h*) can 
be obtained from cubic spline interpolation (see Theorem 8.34). However, 
the numerical implementation is much more involved. This again illustrates 
that the Nystrom method is more practical, since there it is quite easy to 
change the order from O(h?) to O(h*) by simply replacing the weights of 
the trapezoidal rule by those of Simpson’s rule. 

We proceed by discussing the collocation method based on trigonometric 
interpolation with equidistant knots t; = ja/n, j = 0,...,2n — 1. First, 
we establish a convergence result for the trigonometric interpolation of 
differentiable functions (see Problem 8.12). 


Lemma 12.20 Let f € C'[0, 27]. Then for the remainder in trigonometric 
interpolation we have 


[Ln f — flloo < Callf' lle, (12.34) 


where Cn, > 0, n > oo. 


Proof. Consider the trigonometric monomials f,,(t) = e’™ and write m = 
(2k +1)n+q with k € Z and 0 < q < 2n. Since fm(t;) = fg—n(t;) for 
7 =0,...,2n —1, the trigonometric interpolation polynomials for f,, and 
fg—n coincide. Therefore, we have 


LLnfm — fmlloo < 2 


for all |m| > n. Since f is continuously differentiable, we can expand it into 
a uniformly convergent Fourier series (see Problem 12.14) 


OO 


f= SS Am fm: 


m=— oo 
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From the relation 
20 20 
f'(we dt = im f(the7*™ dt = 27imam 
0 0 
for the Fourier coefficients it follows that 
IV'I2 = [ru M(OPdt =2n SS m2 lanl? 
m——co 
Using this identity and the Cauchy—Schwarz inequality, we derive 


2 


— <4 — 1 
LaF _ file <4 5 } |Qn| _ If’ ll2 - a) 
“us m? 
|m|=n m=n 
This implies (12.34). O 


Now, consider an integral equation of the second kind with 27-periodic 
continuously differentiable kernel K and right-hand side f. The corre- 
sponding integral operator A maps C[0,2z] into C'[0,27] and satisfies 
\|(Ay)'l]o < M||ylloo, where M = V2 ||OK/Ot||,.. Therefore, making use 
of (12.34), we find 


|| Ln Ay — Aglloo < en|l(Ay)'ll2 < enM|I¥lloo 


for all y € C|O, 27]. Hence, ||E,A—Alloo < nM — 0,n — o0, and Theorem 
12.16 can be applied to obtain the following result. 


Theorem 12.21 The collocation method with trigonometric polynomials 
converges for integral equations of the second kind with continuously differ- 
entiable periodic kernels and right-hand sides. 


One possibility for the implementation of the collocation method is to 
use the trigonometric monomials as basis functions. Then the integrals 
oT (t;, T)e**" dr have to be integrated numerically. Replacing the kernel 


0 
by its trigonometric interpolation leads to the quadrature formula 


Qn 2n—1 


K(t;,7r)e Je*T dr wy — a n de K (tj, tm)e’**™ 


for 7 = 0,...,2n — 1. Using fast Fourier transform techniques (see Section 
8.2 ) these quadratures can be carried out very rapidly. A second, even 
more efficient, possibility is to use the Lagrange basis 


n—1 
é,(t) = = +2 3 cosm(t — t,) + cosn(t — | (12.35) 


m=1 
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for k = 0,...,2n—1 which can be derived from Theorem 8.25 (see Problem 
12.13). 

For the evaluation of the matrix coefficients f" K (t;,7)€%(7) dt we pro- 
ceed analogously to the preceding case of linear splines. We approximate 
these integrals by replacing K (t;,-) by its trigonometric interpolation poly- 
nomial, i.e., we approximate 


20 2n—1 20 
K(t;,T)ea(r) dr & SY) K(t;, tm) €m (7) bx (7) dr 
0 


m=0 


for j,k = 0,...,2n — 1. Using (12.35), elementary integrations yield (see 
Problem 12.13) 


20 
_ 7 _(—1)ym-k 7 
 Em(t)le(7) dr = bm (FTG, (12.36) 


form,k =0,...,2n—1. Note that despite the global nature of the trigono- 
metric interpolation and its Lagrange basis, due to the simple structure of 
the weights (12.36) in the quadrature rule, the computation of the matrix 
elements is not too costly. The only additional computational effort besides 
the kernel evaluation is the computation of the row sums 


2n—1 


S = (-1)"K (tj, tm) 


m=0 


for 7 = 0,...,2n — 1. We omit the analysis of the additional error in the 
fully discrete method caused by the numerical quadrature. 


Example 12.22 For the integral equation (12.20) from Example 12.15, 
Table 12.5 gives the error between the exact solution and the collocation 
approximation. 


TABLE 12.5. Collocation method for equation (12.20) 


—0.10752855 | —0.03243176 0.03961310 
—0.00231537 0.00059809 0.00045961 
—0.00000044 0.00000002 | —0.00000000 


—0.56984945 | —0.18357135 0.06022598 
—0.14414257 | —0.00368787 | —0.00571394 
—0.00602543 | —0.00035953 | —0.00045408 
—0.00000919 | —0.00000055 | —0.00000069 
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Again we have exponential convergence, as is to be expected from the 
estimate (12.28) and the error analysis for the trigonometric interpolation 
for analytic functions [38]. 0 


In general, the fully discrete implementation of the collocation method 
as described by our two examples can be used in all situations where the re- 
quired numerical quadratures for the matrix elements can be carried out in 
closed form for the chosen approximating subspace and collocation points. 
In all these cases, of course, the quadrature formulae that are required for 
the related Nystrom method will also be available. Because the approxima- 
tion order for both methods usually will be the same, Nystr6m’s method 
is preferable, since it requires the least computational effort for evaluat- 
ing the matrix elements. However, the situation changes in cases where no 
straightforward quadrature rules for the application of Nystr6m’s method 
are available. 

Again, for a greater variety of collocation methods the reader is referred 
to [1, 3, 6, 19, 25, 30, 39, 49]. 


12.5 Stability 


For finite-dimensional approximations of a given operator equation we have 
to distinguish three condition numbers, namely, the condition numbers of 
the original operator and of the approximating operator as mappings in 
the underlying normed spaces, and the condition number of the linear sys- 
tem for the actual numerical solution. This latter system we can influence, 
for example in the collocation method by the choice of the basis for the 
approximating subspaces. 

Consider an equation of the second kind y — Ay = f in a Banach space 
X and approximating equations Yn — AnYn = fn under the assumptions of 
Theorem 12.6, i.e., norm convergence, or of Theorem 12.10, i.e., collective 
compactness and pointwise convergence. Then, recalling Definition 5.2 of 
the condition number, from Theorems 12.6 and 12.10 it follows that the 
condition numbers cond(J — A,) are uniformly bounded. Hence, for the 
condition of the approximating scheme, we mainly have to be concerned 
with the condition of the linear system for the actual computation of the 
solution of Yn — AnYn = fn- 

For the discussion of the condition number for the Nystrom method we 
recall the linear system (12.14) and denote by A, the matrix with the 
entries a,K(x;,x,). We introduce operators R, : Cla, b] > R"*' by 


Rn: f ad (f(z), ve f(2n))*, f € Cla, b], 


and M,, : IR"t' -+ Cla, 6], where M,,® is the piecewise linear interpolation 
with (M,®)(z;) = ®;, 7 =0,...,n, for ® = (o,...,®,)7. (If a < 20, we 
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set (M,®)(x) = ®o for a < x < xo; and if x, < b, we set (M,®)(z) = ©, 
for Zn < x <b.) Then clearly, ||Rplloo = ||Mnllo = 1. 
From Theorem 12.11 we conclude that 


(I — An) = Rn(I — An) My 


and 
(I — An)~! = Ra(I — An)! Mp. 


From these relations we immediately obtain the following theorem. 


Theorem 12.23 For the Nystrom method the condition numbers for the 
linear system are untformly bounded. 


This theorem states that the Nystrom method essentially preserves the 
stability of the original integral equation. 

For the collocation method, we introduce the matrices E,, with entries 
ux(z;) and A, with entries (Au,)(x;). Since X, = span{ug,...,Un} is 
assumed to be such that the interpolation problem with respect to the 
collocation points zo9,..., Zn is uniquely solvable, the matrix E,, is invertible 
(see Problem 8.1). In addition, let the operator W, : R"*t' > C[a,6] be 
defined by 


n 
Wri V> >> yeur 


k=0 
for y = (70,---,%n)? and recall the operators R, and M,, from above. 
Then we have 
Wr = LnMnEn. 


From (12.25) we can conclude that 
(En — An) = RnLn(I — A)Wn 


and 
(En — An)~! = EZ! Rp(I — Ly A)7!LaMy. 


From these three relations, and the fact that by Theorems 12.7 and 12.16 
the sequence of operators (I — L,A)~!L,y is uniformly bounded, we obtain 
the following theorem. 


Theorem 12.24 Under the assumptions of Theorem 12.16, for the collo- 
cation method the condition number of the linear system satisfies 


cond(E, — An) < C||Ly||2, cond E, 


for all sufficiently large n and some constant C. 
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This theorem suggests that the basis functions must be chosen with cau- 
tion. For a poor choice, like monomials, the condition number of EF, can 
grow quite rapidly. However, for the Lagrange basis, i.e., for the linear sys- 
tem (12.26), E,, becomes the identity matrix with condition number one. 
In addition, ||Z,,|| enters in the estimate on the condition number of the 
linear system, and for example, for polynomial or trigonometric polynomial 
interpolation we have ||Z,,|| — oo, n —> oo (see Theorem 8.16). 

In the context of stability we will conclude this chapter with a few re- 
marks on integral equations of the first kind. 


Theorem 12.25 Let X and Y be normed spaces and let A: X + Y bea 
compact linear operator. Then A has a bounded inverse if and only if X ts 
finite-dimensional. 


Proof. Assume that A has a bounded inverse A~! : Y > X. Then we have 
A~‘tA =I, and therefore the identity operator must be compact, since the 
product of a bounded and a compact operator is compact (see Problem 
12.2). However, the identity operator on X is compact if and only if X has 
finite dimension. O 


Theorem 12.25 implies that integral equations of the first kind with con- 
tinuous (or weakly singular) kernels are improperly posed problems in the 
sense of Hadamard, as described in Chapter 5. 

Of course, the ill-posed nature of an equation has consequences for its 
numerical treatment. The fact that an operator does not have a bounded 
inverse means that the condition numbers of its finite-dimensional approx- 
imations grow with the quality of the approximation. Hence, a careless dis- 
cretization of ill-posed problems leads to a numerical behavior that at first 
glance seems to be paradoxical. Namely, increasing the degree of discretiza- 
tion, i.e., increasing the accuracy of the approximation for the operator, will 
cause the approximate solution to the equation to become less and less re- 
liable. Therefore, straightforward application of the methods described in 
this chapter to integral equations of the first kind with continuous kernels 
will generate numerical nonsense. 

To make this remark more vivid, we consider the approximate solution 
of an integral equation of the first kind 


b 
/ K(e,y)oy) dy = f(x), © € [a], 


by the analogue of the linear system (12.14) for the Nystrom method, i.e., 
by 


S- anK (xj, 2%) 9,” = f (25), 7= O,...,n. 
k=0 
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The equation of the first kind 
1 
| (z+ Le p(y)dy =1-e@t), O< aK, (12.37) 
0 


has the unique solution y(x) = e~* (see Problem 12.20). Table 12.6 gives 
the difference between the exact solution and the solution obtained by the 
quadrature method using the (composite) trapezoidal rule. 


TABLE 12.6. Numerical solution of (12.37) 


0.4057 0.3705 0.1704 
—4.5989 14.6094 —4.4770 


—8.5957 2.2626 | —153.4805 
3.8965 | —32.2907 22.9970 
—88.6474 | —6.4484 | —182.6745 


We observe that the approximation is completely useless and that in 
agreement with the above remarks, the quality of the approximation de- 
creases when the accuracy of the quadrature is increased. (Of course, the 
actual numerical values for the solution of the ill-conditioned linear system 
of this example will depend on the actual computer and the code for solving 
the linear system that is used.) 

Hence, the numerical solution of integral equations of the first kind with 
continuous kernels requires regularization methods such as Tikhonov reg- 
ularization or singular value cutoff, which we discussed in Chapter 5 for 
the finite-dimensional case. These regularization techniques now, of course, 
need to be analyzed in an appropriate function space setting. We recall the 
corresponding references to [14, 22, 28, 37, 39, 43] from Chapter 5 for the 
foundation of regularization methods in Hilbert spaces. 


Problems 


12.1 Show that the boundary value problem for the differential equation 
—u"+qu=r_ in [0,1] 


with boundary conditions u(0) = u(1) = 0 is equivalent to finding a continuous 
solution of the integral equation of the second kind 


1 


u(x) + / G(x, y)q(y)uly) dy = / G(z,y)r(y)dy, zx € [0,1], 
0 0 
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where 
(1—2x)y, O<y<2z<l, 
G(z,y):= 


is the so-called Green’s function of the boundary value problem. 


12.2 Show that linear combinations of compact linear operators are compact 
and that the product of two bounded linear operators is compact if one of the 
factors is compact. 


12.3 Show that the integral operator with continuous kernel is a compact op- 
erator from L’[a,b] into L7/a, d). 


12.4 Show that the Volterra integral equation of the second kind 


p(x) - / K(2,y)o(y) dy = f(x), 2 € [a, 8}, 


with continuous kernel K has a unique continuous solution y for each continuous 
right-hand side f. 
Hint: Show that the homogeneous equation allows only the trivial solution and 
use Theorem 12.2. 


12.5 Solve the Volterra integral equation 
ola) fe *e(u)dy = F@ 
0 


by successive approximations. 


12.6 Show that a sequence A, : X — Y of compact linear operators mapping 
a normed space X into a normed space Y is collectively compact if and only if 
for each bounded sequence (Yn) in X the sequence (Any, ) contains a convergent 
subsequence. 


12.7 Show that a sequence (y,) of functions yp : [a,b] > R that is equicontin- 
uous and converges pointwise on [a, b] to some function ¢ : [a,b] — IR converges 
uniformly on [a, 6]. 


12.8 Prove the Banach-Steinhaus theorem: Let A: X — Y be a bounded linear 
operator and let A, : X — Y be a sequence of bounded linear operators from a 
Banach space X into a normed space Y. For pointwise convergence Any — Ay, 
n — oo, for all y € X it is necessary and sufficient that ||An|| < C for all n € IN 
with some constant C and that Any > Ay, n — oo, for all y € U, where U is 
some dense subset of X (compare Theorem 9.10). 


12.9 For the integral operator A and the numerical integration operators using 
the (composite) trapezoidal rule, derive bounds on ||(An — A)A|loo and 
l}(An — A)An|lo. Relate the results to Lemma 12.9. 


12.10 Verify the integrals (12.21). 
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12.11 Write a computer program for the Nyst6m method allowing the use of 
different quadrature formulae and test it for various examples. 


12.12 Use the quadrature formula (9.36) with the substitution (9.47) in a 
Nystrom method for the integral equation (12.19). Compare the numerical re- 
sults with those obtained from the trapezoidal and Simpson’s rule. 


12.13 Verify the Lagrange basis (12.35) and the integrals (12.36). 


12.14 Show that the Fourier series of a continuously differentiable periodic func- 
tion is uniformly convergent. 


12.15 In the degenerate kernel approzimation the integral equation of the sec- 
ond kind with continuous kernel K is approximated by the solutions of 


b 
pa(e) — / Kn(2,y)en(y) dy = f(a), 2 € [a,b], 


with an approximate kernel K,, of the form 


me 


Kn(z,y) = >_ aj (x)b;(y). 


j=0 


Show how the solution of the approximate equation can be reduced to solving 
a system of linear equations. Give an error and convergence analysis based on 
Theorem 12.6. 


12.16 Use the results of Problem 12.15 to prove Theorem 12.2 for the case of 
an integral equation of the second kind with continuous kernel. 


12.17 Construct degenerate kernels via interpolation of the kernel K with re- 
spect to the first variable and relate this particular degenerate kernel method to 
the collocation method (see Problem 12.15). 


12.18 The idea of two-grid and multigrid iterations can also be applied to 
integral equations of the second kind. For its theoretical foundation assume 
the sequence of operators A, : X — X to be either norm convergent (i.e., 
\|An — Al] + 0, n > oo) or collectively compact and pointwise convergent (i.e., 
Any — Ay, n — oo, for all y € X). Show that the defect correction iteration 


Pnyv+i i= (I _ An-1)'{(An _ An-1) Pn wv + fr}, y= 0,1, 2,. a) 


using the preceding coarser level converges, provided that n is sufficiently large. 
Show that the defect correction iteration 


Pn v4+l1 = (I —_ Ao) *{(An _ Ao) Yn,v + Fr}, Y= 0, 1, 2, sang 


using the coarsest level converges, provided that the approximation Apo is suffi- 
ciently close to A. 
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12.19 Consider the two-grid iteration 
Pnytl = (I ~ Am) '{(An ~ Am) Pn,v + fr}, y= 0, 1, 2, rey 


with m = n—1 or m = O for the Nystrom method, i.e., for the numerical 
quadrature operators 


(Any)(2) 


Soa K(a, 2)” pay”), 2 € (0, 1], 
k=1 


with z, quadrature points. Show that each iteration step requires the following 
computations. First 

guy -= fr + (A, — Am)Pnv 
(m) 
j 
,j=1,...,2n, on the level n by setting 


has to be evaluated at the z, quadrature points x; ’, 7 = 1,...,2Zm, on the level 
(n) 
j 
, respectively, in 


m and at the z, quadrature points x 


C= i) and ¢ = xi) 


in 2m 
gn,v (2) = fn(x) + > at” K (a, oon, (a) — 3 al”) K (a, we” ony (a), 
k=1 k=1 
Then one has to solve the linear system 
priv (e™)— Sal K (0, 26 pnw sa (zy”) = gno(a™),  § = Aye sem, 
k=1 


for the values Pnv+i(ry”) at the z quadrature points xi”), Finally, the values 


at the z, quadrature points ai"), j =1,...,2n, are obtained from the Nystrom 
interpolation 
zm 
prvsi(2) = Sal (0, £0” )pnvaa (eh) + gnv(0),  § = Lys 2m: 
k=1 


Make an operation count for one step of the defect correction iteration. Set up 
the corresponding equations for the collocation method. 


12.20 Show that the integral equation (12.37) has a unique solution. 
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